Buerki, Andreas ORCID: https://orcid.org/0000-0003-2151-3246
2017.
Frequency consolidation among word N-grams: a practical procedure.
Mitkov, Ruslan, ed.
Computational and Corpus-Based Phraseology,
Lecture notes in Computer Science,
vol. 10596.
Cham:
Springer,
pp. 432-446.
(10.1007/978-3-319-69805-2)
|
Preview |
PDF
- Published Version
Download (365kB) | Preview |
Abstract
This paper considers the issue of frequency consolidation in lists of different length word n-grams (i.e. recurrent word sequences) extracted from the same underlying corpus. A simple algorithm – enhanced by a preparatory stage – is proposed which allows the con- solidation of frequencies among lists of different length n-grams, from 2-grams to 6-grams and beyond. The consolidation adjusts the frequency count of each n-gram to the number of its occurrences minus its occur- rences as part of longer n-grams. Among other uses, such a procedure aids linguistic analysis and allows the non-inflationary counting of word tokens that are part of frequent n-grams of various lengths, which in turn allows an assessment of the proportion of running text made up of recurring chunks. The proposed procedure delivers frequency consolida- tion and substring reduction among word n-grams and is independent of any particular method of n-gram extraction and filtering, making it applicable also in situations where full access to underlying corpora is unavailable.
| Item Type: | Book Section |
|---|---|
| Date Type: | Publication |
| Status: | Published |
| Schools: | Schools > English, Communication and Philosophy |
| Subjects: | P Language and Literature > P Philology. Linguistics |
| Publisher: | Springer |
| ISBN: | 9783319698045 |
| Date of First Compliant Deposit: | 9 November 2017 |
| Last Modified: | 03 Nov 2022 09:54 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/106389 |
Actions (repository staff only)
![]() |
Edit Item |





Altmetric
Altmetric