Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

Chakravarthi, Bharathi Raja, Priyadharshini, Ruba, Muralidaran, Vigneshwaran, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth and McCrae, John P. 2022. DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text. Language Resources and Evaluation 56 (3) , pp. 765-806. 10.1007/s10579-022-09583-7

[thumbnail of 10579_2022_Article_9583.pdf] PDF - Published Version
Download (2MB)

Abstract

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.

Item Type: Article
Date Type: Published Online
Status: Published
Schools: Computer Science & Informatics
Additional Information: License information from Publisher: LICENSE 1: URL: http://creativecommons.org/licenses/by/4.0/, Type: open-access
Publisher: Springer
ISSN: 1574-020X
Date of First Compliant Deposit: 19 August 2022
Date of Acceptance: 24 January 2022
Last Modified: 19 Aug 2022 16:45
URI: https://orca.cardiff.ac.uk/id/eprint/152052

Citation Data

Cited 9 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics