Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Measuring and matching corpus size cross-linguistically

Buerki, Andreas ORCID: 2017. Measuring and matching corpus size cross-linguistically. Presented at: Inter-Varietal Applied Corpus Studies Annual Symposium, Cardiff University, Cardiff, UK, 4 March 2017.

Full text not available from this repository.


Corpus size has traditionally been measured in number of words. Working with a single (European) language, this is adequate in most circumstances. Problems arise when corpora are compared cross-linguistically: accurately measuring the amount of linguistic material represented in a corpus is pivotal for many analyses, including for such corpus linguistic staples as keyword analyses, frequency measurements of linguistic items and normalisation of frequencies. Yet words in different languages are very different things. Isolating languages like English use up a high number of words compared to synthetic and polysynthetic languages that use drastically fewer words to express similar messages (e.g. German Donaudampfsschifffahrtsgesellschaftskapitän, i.e. captain of the Danube steamship company). Further, how the concept of (orthographic) word itself should be operationalized is not immediately clear in many languages. Parallel corpora circumvent this problem by using translation equivalents, but the characteristics of translated texts are known to differ, not least in terms of length, from non-translated texts, thereby making this solution less than ideal for many applications. In this paper, I outline a number of ways in which the challenge of producing comparably-sized corpora of different languages might be approached. I focus on the example of a tri-lingual Wikipedia corpus of an overall 93 million words of German, Korean and English, produced to facilitate an investigation into universal aspects of formulaic language which necessitated precisely measured corpus sizes. Drawing on recent work on linguistic complexity, it is illustrated how the syllable level can be used as a comparative level via an information density estimate which is either based on samples of mixed direction translations or the syllable inventory of a language. Finally, the magnitude of differences in word count of otherwise comparably sized sub-corpora of the mentioned Wikipedia corpus illustrates the importance of measures of corpus size more accurate than word counts.

Item Type: Conference or Workshop Item (Paper)
Date Type: Completion
Status: Unpublished
Schools: English, Communication and Philosophy
Subjects: P Language and Literature > P Philology. Linguistics
Last Modified: 21 Oct 2022 06:54

Actions (repository staff only)

Edit Item Edit Item