Buerki, Andreas ORCID: https://orcid.org/0000-0003-2151-3246 2011. Substring reduction and frequency consolidation among word N-Grams: suggestion for a procedure. Presented at: 6th International Newcastle Postgraduate Conference in Linguistics, Newcastle, 7 April 2011. |
Abstract
Multiword units (MWUs), also known as linguistic prefabs or chunks, are increasingly recognised not only as central to language, but as posing a significant challenge to the traditional conception of linguistic knowledge as consisting of a distinct lexicon on the one hand and a set of combinatory rules on the other. Identification of MWUs in texts is an important task which has attracted much recent research. A related, but much less investigated problem, however, is the substring reduction and frequency consolidation of MWU-candidates of different lengths (i.e. word n-grams). This paper proposes a simple algorithm enhanced by a preparatory stage which allows the consolidation of frequencies of substrings, even if these are not of the same frequency as the superstring. The proposed algorithm is able to deal with a theoretically unlimited number of levels (i.e. lengths of n-grams). While this procedure works best with contiguous word n-grams and requires the exclusion of n-grams that span sentence-boundaries, it is able keep track of frequencies and any errors in a manner that reduces inaccuracies to a minimum. An accurate substring reduction and consolidation of frequencies is useful in many respects: besides reducing the number of MWU-candidates that need to be analysed, such a procedure can establish the most common form (i.e. length) in a MWU-candidate cluster (thus assisting identification of MWUs) and help answer theoretical questions about the quantitative importance of MWUs as well as other questions of interest to corpus-based linguistic MWU research.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Completion |
Status: | Unpublished |
Schools: | English, Communication and Philosophy |
Subjects: | P Language and Literature > P Philology. Linguistics |
Last Modified: | 28 Oct 2022 10:21 |
URI: | https://orca.cardiff.ac.uk/id/eprint/77966 |
Actions (repository staff only)
Edit Item |