Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Go simple and pre-train on domain-specific corpora: on the role of training data for text classification

Edwards, Aleksandra, Camacho-Collados, Jose ORCID: https://orcid.org/0000-0003-1618-7239, De Ribaupierre, Hélène and Preece, Alun ORCID: https://orcid.org/0000-0003-0349-9057 2020. Go simple and pre-train on domain-specific corpora: on the role of training data for text classification. Presented at: 28th International Conference on Computational Linguistics, Barcelona, Spain, 8-13 December 2020. Published in: Scott, Donia, Bel, Nuria and Zong, Chengqing eds. Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 5522–5529. 10.18653/v1/2020.coling-main.481

[thumbnail of 2020.coling-main.481.pdf] PDF - Published Version
Available under License Creative Commons Attribution.

Download (725kB)

Abstract

Pre-trained language models provide the foundations for state-of-the-art performance across a wide range of natural language processing tasks, including text classification. However, most classification datasets assume a large amount labeled data, which is commonly not the case in practical settings. In particular, in this paper we compare the performance of a light-weight linear classifier based on word embeddings, i.e., fastText (Joulin et al., 2017), versus a pre-trained language model, i.e., BERT (Devlin et al., 2019), across a wide range of datasets and classification tasks. In general, results show the importance of domain-specific unlabeled data, both in the form of word embeddings or language models. As for the comparison, BERT outperforms all baselines in standard datasets with large training sets. However, in settings with small training datasets a simple method like fastText coupled with domain-specific word embeddings performs equally well or better than BERT, even when pre-trained on domain-specific data.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Schools > Computer Science & Informatics
Publisher: International Committee on Computational Linguistics
ISBN: 978-1-952148-27-9
Date of First Compliant Deposit: 25 September 2025
Last Modified: 25 Sep 2025 13:30
URI: https://orca.cardiff.ac.uk/id/eprint/181321

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics