Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Combining BERT with static word embeddings for categorizing social media

Alghanmi, Israa, Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176 and Schockaert, Steven ORCID: https://orcid.org/0000-0002-9256-2881 2020. Combining BERT with static word embeddings for categorizing social media. Presented at: 6th Workshop on Noisy User-generated Text (W-NUT 2020), Virtual, 19 November 2020. Published in: Xu, W., Ritter, A., Baldwin, T. and Rahimi, A. eds. Proceedings of the Sixth Workshop on Noisy User-generated Text. Association for Computational Linguistics, pp. 28-33. 10.18653/v1/2020.wnut-1.5

[thumbnail of BERT_Static_EMNLP_Sytle.pdf]
Preview
PDF - Accepted Post-Print Version
Download (100kB) | Preview

Abstract

Pre-trained neural language models (LMs) have achieved impressive results in various natural language processing tasks, across different languages. Surprisingly, this extends to the social media genre, despite the fact that social media often has very different characteristics from the language that LMs have seen during training. A particularly striking example is the performance of AraBERT, an LM for the Arabic language, which is successful in categorizing social media posts in Arabic dialects, despite only having been trained on Modern Standard Arabic. Our hypothesis in this paper is that the performance of LMs for social media can nonetheless be improved by incorporating static word vectors that have been specifically trained on social media. We show that a simple method for incorporating such word vectors is indeed successful in several Arabic and English benchmarks. Curiously, however, we also find that similar improvements are possible with word vectors that have been trained on traditional text sources (e.g. Wikipedia).

Item Type: Conference or Workshop Item (Paper)
Date Type: Published Online
Status: Published
Schools: Professional Services > Advanced Research Computing @ Cardiff (ARCCA)
Schools > Computer Science & Informatics
Publisher: Association for Computational Linguistics
Date of First Compliant Deposit: 20 October 2020
Date of Acceptance: 29 September 2020
Last Modified: 18 Dec 2025 12:27
URI: https://orca.cardiff.ac.uk/id/eprint/135741

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics