A multilingual, multimodal dataset of aggression and bias: the ComMA dataset |

Shyam Ratan, PhD student in the Centre of Applied Linguistics and Translation Studies (CALTS), School of Humanities (SoH), University of Hyderabad (UoH) working with Prof. Selvaraj Arulmozi, published a research paper in Journal of Language Resources and Evaluation (LREV), Springer on November 16, 2023.

Shyam Ratan with his co-authors published a paper in Journal on the research topic, “A multilingual, multimodal dataset of aggression and bias: the ComMA dataset”, In our published paper, we discuss the development of a multilingual dataset annotated with a hierarchical, fine-grained tagset marking different types of aggression and the “context” in which they occur. The context, here, is defined by the conversational thread in which a specific comment occurs and also the “type” of discursive role that the comment is performing with respect to the previous comment(s). The dataset has been developed as part of the ComMA Project and consists of a total of 57,363 annotated comments, 1142 annotated memes, and around 70 h of annotated audio (extracted from videos) in four languages—Meitei, Bangla, Hindi, and Indian English. This data has been collected from various social media platforms such as YouTube, Facebook, Twitter, and Telegram. As is usual on social media websites, a large number of these comments are multilingual, and many are code-mixed with English.

This paper gives a detailed description of the tagset developed during the course of this project and elaborates on the process of developing and using a multi-label, fine-grained tagset for marking comments with aggression and bias of various kinds, which includes gender bias, religious intolerance (called communal bias in the tagset), class/caste bias, and ethnic/racial bias. We define and discuss the tags that have been used for marking different discursive roles being performed through the comments, such as attack, defend, and so on. We also present a statistical analysis of the dataset as well as the results of our baseline experiments for developing an automatic aggression identification system using the dataset developed. Based on the results of the baseline experiments, we also argue that our dataset provides diverse and ‘hard’ sets of instances which makes it is a good dataset for training and testing new techniques for aggressive and abusive language classification.

https://link.springer.com/article/10.1007/s10579-023-09696-7

A multilingual, multimodal dataset of aggression and bias: the ComMA dataset

SEARCH:

Subscribe to the UoH awesomeness!

CONTACT US:

Archives

Categories

Recent Posts