Borkakoty, Hsuvas and Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176
2024.
HOAXPEDIA: A unified Wikipedia hoax articles dataset.
Presented at: 2024 Conference on Empirical Methods in Natural Language Processing,
Miami, Florida,
12-16 November 2024.
Published in: Lucie-Aimée, Lucie, Fan, Angela, Gwadabe, Tajuddeen, Johnson, Isaac, Petroni, Fabio and van Strien, Daniel eds.
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia.
Association for Computational Linguistics,
53–66.
Item availability restricted. |
PDF
- Published Version
Restricted to Repository staff only until 27 December 2024 due to copyright restrictions. Download (779kB) |
Abstract
Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce HOAXPEDIA, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article’s definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Computer Science & Informatics |
Publisher: | Association for Computational Linguistics |
Date of First Compliant Deposit: | 27 November 2024 |
Date of Acceptance: | 2 October 2024 |
Last Modified: | 27 Nov 2024 10:16 |
URI: | https://orca.cardiff.ac.uk/id/eprint/173692 |
Actions (repository staff only)
Edit Item |