Muhammad, Shamsuddeen Hassan, Abdulmumin, Idris, Ayele, Abinew Ali, Adelani, David Ifeoluwa, Ahmad, Ibrahim Said, Aliyu, Saminu Mohammad, Röttger, Paul, Oppong, Abigail, Bukula, Andiswa, Chukwuneke, Chiamaka Ijeoma, Jibril, Ebrahim Chekol, Ismail, Elyas Abdi, Alemneh, Esubalew, Gebremichael, Hagos Tesfahun, Aliyu, Lukman Jibril, Beloucif, Meriem, Hourrane, Oumaima, Mabuya, Rooweither, Osei, Salomey, Rutunda, Samuel, Belay, Tadesse Destaw, Guge, Tadesse Kebede, Asfaw, Tesfa Tegegne, Wanzare, Lilian Diana Awuor, Onyango, Nelson Odhiambo, Yimam, Seid Muhie and Ousidhoum, Nedjma 2025. AfriHate: A multilingual collection of hate speech and abusive language datasets for African languages. Presented at: NAACL 2025, Albuquerque, New Mexico, 29 April - 4 May 2025. Published in: Chiruzzo, Luis, Ritter, Alan and Wang, Lu eds. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies. , vol.1 Association for Computational Linguistics, pp. 1854-1871. 10.18653/v1/2025.naacl-long.92 |
Abstract
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked.These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is a tweet annotated by native speakers familiar with the regional culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. We find that model performance highly depends on the language and that multilingual models can help boost performance in low-resource settings.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Schools > Computer Science & Informatics |
Publisher: | Association for Computational Linguistics |
ISBN: | 979-8-89176-189-6 |
Last Modified: | 10 Jun 2025 09:32 |
URI: | https://orca.cardiff.ac.uk/id/eprint/178951 |
Actions (repository staff only)
![]() |
Edit Item |