Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

HOAXPEDIA: A unified Wikipedia hoax articles dataset

Borkakoty, Hsuvas and Espinosa-Anke, Luis ORCID: https://orcid.org/0000-0001-6830-9176 2024. HOAXPEDIA: A unified Wikipedia hoax articles dataset. Presented at: 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, 12-16 November 2024. Published in: Lucie-Aimée, Lucie, Fan, Angela, Gwadabe, Tajuddeen, Johnson, Isaac, Petroni, Fabio and van Strien, Daniel eds. Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia. Association for Computational Linguistics, 53–66.
Item availability restricted.

[thumbnail of 2405.02175v3.pdf] PDF - Published Version
Restricted to Repository staff only until 27 December 2024 due to copyright restrictions.

Download (779kB)

Abstract

Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce HOAXPEDIA, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article’s definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Computer Science & Informatics
Publisher: Association for Computational Linguistics
Date of First Compliant Deposit: 27 November 2024
Date of Acceptance: 2 October 2024
Last Modified: 27 Nov 2024 10:16
URI: https://orca.cardiff.ac.uk/id/eprint/173692

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics