Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Semantically enhanced text stemmer (SETS) for cross-domain document clustering

Stankov, Ivan Dimitrov, Todorov, Diman and Setchi, Rossitza ORCID: https://orcid.org/0000-0002-7207-6544 2013. Semantically enhanced text stemmer (SETS) for cross-domain document clustering. Presented at: KES 2012: 16th International Conferenc eon Knowledge-Based and Intelligent Information and Engineering Systems, San Sebastian, Spain, 10-12 September 2012. Published in: Grana, M., Toro, C., Howlett, R. J. and Jain, L. C. eds. Knowledge Engineering, Machine Learning and Lattice Computing with Applications: 16th International Conference, KES 2012, San Sebastian, Spain, September 10-12, 2012, Revised Selected Papers. Lecture Notes in Computer Science. Lecture Notes in Computer Science , vol.7828 Berlin and Heidelberg: Springer, 108 -118. 10.1007/978-3-642-37343-5_12

Full text not available from this repository.

Abstract

This paper focuses on processing cross-domain document repositories, which is challenged by the word ambiguity and the fact that monosemic words are more domain-oriented than polysemic ones. The paper describes a semantically enhanced text normalization algorithm (SETS) aimed at improving document clustering and investigates the performance of the sk-means clustering algorithm across domains by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 generic sub-domains of a thousand documents each randomly selected from the Reuters21578 corpus. The experimental results demonstrate improved coherence of the clusters produced by SETS compared to the text normalization obtained with the Porter stemmer. In addition, semantic-based text normalization is shown to be resistant to noise, which is often introduced in the index aggregation stage.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Engineering
Subjects: T Technology > TA Engineering (General). Civil engineering (General)
Additional Information: Knowledge Engineering, Machine Learning and Lattice Computing with Applications. 16th International Conference, KES 2012, San Sebastian, Spain, September 10-12, 2012, Revised Selected Papers Series: Lecture Notes in Computer Science, Vol. 7828 Subseries: Lecture Notes in Artificial Intelligence
Publisher: Springer
ISBN: 9783642373428
ISSN: 0302-9743
Last Modified: 06 Jul 2023 10:18
URI: https://orca.cardiff.ac.uk/id/eprint/47607

Actions (repository staff only)

Edit Item Edit Item