Shyam Ratan, PhD student in the Centre of Applied Linguistics and Translation Studies (CALTS), School of Humanities (SoH), University of Hyderabad (UoH) working with Prof. Selvaraj Arulmozi, attended a 2nd Annual Meeting of the Special Interest Group on Under-resourced Languages (SIGUL 2023), a Satellite Workshop of Interspeech 2023, International Speech Communication Association (ISCA)/European Language Resources Association (ELRA), 18-20 August 2023, Dublin, Ireland.

Shyam Ratan
Shyam Ratan with his co-authors presented a paper on the research topic, “Collecting Speech Data for Endangered and Under-resourced Indian Languages”, In our proposed paper, we presented the preparation of speech corpora for languages un(der)represented on the web largely depends on the manual methods of data collection and processing from different sources. The methods used in field linguistics and documentary linguistics for collecting data from the speech communities provide a valuable set of resources and methodologies for such data collection but these methods were not developed and optimised for large-scale data collection. However, this limitation could be overcome by combining linguistic field methods with crowdsourcing for data collection. In this paper, we discuss two such ongoing projects – SpeeD-TB and SpeeD-IA – in which we are experimenting with different methods and developing software and other infrastructure to rapidly collect speech data in six Tibeto-Burman – Toto, Chokri, Nyishi, Kok Borok, Bodo and Meitei – and four Indo-Aryan – Awadhi, Bhojpuri, Braj and Magahi – languages in India. Till now we have collected over 40 hours of speech data in these languages and over the period of the next year, we plan to collect a total of approximately 1,200 hours of speech data.