DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Chakravarthi, Bharathi Raja, Priyadharshini, Ruba, Muralidaran, Vigneshwaran, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth and McCrae, John P. 2022. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation 56 (3) , pp. 765-806. 10.1007/s10579-022-09583-7

[thumbnail of 10579_2022_Article_9583.pdf]

PDF - Published Version
Download (2MB)

License URL: http://creativecommons.org/licenses/by/4.0/

Official URL: https://doi.org/10.1007/s10579-022-09583-7

Abstract

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Item Type:	Article
Date Type:	Published Online
Status:	Published
Schools:	Schools > Computer Science & Informatics
Additional Information:	License information from Publisher: LICENSE 1: URL: http://creativecommons.org/licenses/by/4.0/, Type: open-access
Publisher:	Springer
ISSN:	1574-020X
Date of First Compliant Deposit:	19 August 2022
Date of Acceptance:	24 January 2022
Last Modified:	04 May 2023 18:36
URI:	https://orca.cardiff.ac.uk/id/eprint/152052

Citation Data

Cited 67 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item

Altmetric

Dimensions

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)