AfriSenti: a twitter sentiment analysis benchmark for African languages

Muhammad, Shamsuddeen, Abdulmumin, Idris, Ayele, Abinew, Ousidhoum, Nedjma, Adelani, David, Yimam, Seid, Ahmad, Ibrahim, Beloucif, Meriem, Mohammad, Saif, Ruder, Sebastian, Hourrane, Oumaima, Jorge, Alipio, Brazdil, Pavel, Ali, Felermino, David, Davis, Osei, Salomey, Shehu-Bello, Bello, Lawan, Falalu, Gwadabe, Tajuddeen, Rutunda, Samuel, Belay, Tadesse, Messelle, Wendimu, Balcha, Hailu, Chala, Sisay, Gebremichael, Hagos, Opoku, Bernard and Arthur, Stephen 2023. AfriSenti: a twitter sentiment analysis benchmark for African languages. Presented at: The 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, 6 - 10 December 2023. Published in: Bouamor, Houda, Pino, Juan and Bali, Kalika eds. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 13968 – 13981. 10.18653/v1/2023.emnlp-main.862

Full text not available from this repository.

Official URL: http://dx.doi.org/10.18653/v1/2023.emnlp-main.862

Abstract

Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. These include 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families. The tweets were annotated by native speakers and used in the AfriSenti-SemEval shared task 1. We describe the data collection methodology, annotation process, and the challenges we dealt with when curating each dataset. We further report baseline experiments conducted on the different datasets and discuss their usefulness.

Item Type:	Conference or Workshop Item - published (Paper)
Date Type:	Publication
Status:	Published
Schools:	Schools > Computer Science & Informatics
Publisher:	Association for Computational Linguistics
ISBN:	9781713885917
Related URLs:	Organisation
Last Modified:	12 Jun 2025 14:45
URI:	https://orca.cardiff.ac.uk/id/eprint/178470

Actions (repository staff only)

Edit Item

Dimensions

Altmetric

CORE (COnnecting REpositories)