Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

What genre is Wikipedia?

Buerki, Andreas ORCID: 2021. What genre is Wikipedia? Presented at: Corpus Linguistics International Conference 2021 (CL2021), Limerick, Ireland, 13-16 July 2021.

[thumbnail of presentation] Video (MPEG) (presentation) - Published Version
Available under License Creative Commons Attribution No Derivatives.

Download (63MB)


Producing large language corpora is not only highly work-intensive, but also increasingly a process made difficult (and in some cases impossible) by legal hurdles, from copyright to data protection legislation, as well as potentially extensive ethics approval procedures at research institutions. Meanwhile, a vast and topically diverse and multilingual resource is readily available for use and redistribution in the form of Wikipedia ( An important determinant the corpus-linguistic usefulness of this resource is the question after what generalisations can be drawn from linguistic observations derived from Wikipedia texts. One aspect of this, in turn, concerns the extent to which Wikipedia texts represent a unique genre, separate from genres one might find in a reference corpus – in other words, it concerns the question of what genre Wikipedia is. In this presentation, I approach this question from the point of view of language-internal markers of genre that correlate with external designations such as those used by corpus compilers. Gries (2010) showed that one such marker, the Gravity measure of relative association strength in shared bigrams (Daudaravičius and Marcinkevičienė, 2004), is particularly promising: of 19 sub-registers of the BNC Baby, Gries was able to perfectly re-create the four main register groupings of the BNC Baby using hierarchical clustering. Employing the same technique, I show that Wikipedia texts form a separately distinguishable genre grouping, on par with the BNC Baby’s other genre categories. Wikipedia texts pattern nearest journalistic prose, and in the vicinity of the BNC Baby’s fiction and academic genres and together with the other written genres is clearly distinguished from the BNC Baby’s spoken language genres. This confirms that Wikipedia texts are not representative of language in the way a reference corpus is, but equally that Wikipedia texts group with other written genres of the BNC Baby. Finally, I reflect on the methodological merits of the approach taken and its robustness.

Item Type: Conference or Workshop Item (Paper)
Status: In Press
Schools: English, Communication and Philosophy
Subjects: P Language and Literature > P Philology. Linguistics
Related URLs:
Date of First Compliant Deposit: 27 June 2021
Last Modified: 09 Nov 2022 11:10

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics