ORCA
Online Research @ Cardiff

Clear Cookie - decide language by browser settings

Topic modelling: Going beyond token outputs

Williams, Lowri, Anthi, Eirini, Arman, Laura

and Burnap, Peter

2024. Topic modelling: Going beyond token outputs. Big Data and Cognitive Computing 8 (5) , 44. 10.3390/bdcc8050044

Preview

PDF - Published Version
Available under License Creative Commons Attribution.
Download (12MB) | Preview

License URL: http://creativecommons.org/licenses/by/4.0

License Start date: 25 April 2024

Official URL: https://doi.org/10.3390/bdcc8050044

Abstract

Topic modelling is a text mining technique for identifying salient themes from a number of documents. The output is commonly a set of topics consisting of isolated tokens that often co-occur in such documents. Manual effort is often associated with interpreting a topic's description from such tokens. However, from a human's perspective, such outputs may not adequately provide enough information to infer the meaning of the topics; thus, their interpretability is often inaccurately understood. Although several studies have attempted to automatically extend topic descriptions as a means of enhancing the interpretation of topic models, they rely on external language sources that may become unavailable, must be kept up-to-date to generate relevant results, and present privacy issues when training on or processing data. This paper presents a novel approach towards extending the output of traditional topic modelling methods beyond a list of isolated tokens. This approach removes the dependence on external sources by using the textual data itself by extracting high-scoring keywords and mapping them to the topic model's token outputs. To compare how the proposed method benchmarks against the state-of-the-art, a comparative analysis against results produced by Large Language Models (LLMs) is presented. Such results report that the proposed method resonates with the thematic coverage found in LLMs, and often surpasses such models by bridging the gap between broad thematic elements and granular details. In addition, to demonstrate and reinforce the generalisation of the proposed method, the approach was further evaluated using two other topic modelling methods as the underlying models and when using a heterogeneous unseen dataset. To measure the interpretability of the proposed outputs against those of the traditional topic modelling approach, independent annotators manually scored each output based on their quality and usefulness, as well as the efficiency of the annotation task. The proposed approach demonstrated higher quality and usefulness, as well as higher efficiency in the annotation task, in comparison to the outputs of a traditional topic modelling method, demonstrating an increase in their interpretability.

Item Type:	Article
Date Type:	Publication
Status:	Published
Schools:	Schools > Computer Science & Informatics Schools > Social Sciences (Includes Criminology and Education)
Publisher:	MDPI
ISSN:	2504-2289
Funders:	ESRC
Date of First Compliant Deposit:	23 April 2024
Date of Acceptance:	23 April 2024
Last Modified:	23 May 2024 09:50
URI:	https://orca.cardiff.ac.uk/id/eprint/168216

Actions (repository staff only)

Edit Item

Dimensions

Altmetric

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)