Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Substring reduction and frequency consolidation among word N-Grams: suggestion for a procedure

Buerki, Andreas ORCID: 2011. Substring reduction and frequency consolidation among word N-Grams: suggestion for a procedure. Presented at: 6th International Newcastle Postgraduate Conference in Linguistics, Newcastle, 7 April 2011.

Full text not available from this repository.


Multiword units (MWUs), also known as linguistic prefabs or chunks, are increasingly recognised not only as central to language, but as posing a significant challenge to the traditional conception of linguistic knowledge as consisting of a distinct lexicon on the one hand and a set of combinatory rules on the other. Identification of MWUs in texts is an important task which has attracted much recent research. A related, but much less investigated problem, however, is the substring reduction and frequency consolidation of MWU-candidates of different lengths (i.e. word n-grams). This paper proposes a simple algorithm enhanced by a preparatory stage which allows the consolidation of frequencies of substrings, even if these are not of the same frequency as the superstring. The proposed algorithm is able to deal with a theoretically unlimited number of levels (i.e. lengths of n-grams). While this procedure works best with contiguous word n-grams and requires the exclusion of n-grams that span sentence-boundaries, it is able keep track of frequencies and any errors in a manner that reduces inaccuracies to a minimum. An accurate substring reduction and consolidation of frequencies is useful in many respects: besides reducing the number of MWU-candidates that need to be analysed, such a procedure can establish the most common form (i.e. length) in a MWU-candidate cluster (thus assisting identification of MWUs) and help answer theoretical questions about the quantitative importance of MWUs as well as other questions of interest to corpus-based linguistic MWU research.

Item Type: Conference or Workshop Item (Paper)
Date Type: Completion
Status: Unpublished
Schools: English, Communication and Philosophy
Subjects: P Language and Literature > P Philology. Linguistics
Last Modified: 28 Oct 2022 10:21

Actions (repository staff only)

Edit Item Edit Item