Stankov, Ivan Dimitrov, Todorov, Diman and Setchi, Rossitza ORCID: https://orcid.org/0000-0002-7207-6544 2013. Enhanced Cross-Domain Document Clustering with a Semantically Enhanced Text Stemmer (SETS). International Journal of Knowledge-based and Intelligent Engineering Systems 17 (2) , pp. 113-126. 10.3233/KES-130267 |
Abstract
The aim of document clustering is to produce coherent clusters of similar documents. Clustering algorithms rely on text normalisation techniques to represent and cluster documents. Although most document clustering algorithms perform well in specific knowledge domains, processing cross-domain document repositories is still a challenge. This paper attempts to address this challenge. It investigates the performance of the sk-means clustering algorithm across domains, by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 different generic sub-domains of a thousand documents, each randomly selected from the Reuters21578 corpus. The experimental results obtained from the evaluation demonstrate improved coherence of clusters produced by using a semantically enhanced text stemmer (SETS), when compared to the text normalisation obtained with the Porter stemmer. In addition, semantic-based text normalisation is shown to be resistant to noise, which is often introduced in the index aggregation stage, a stage that acquires features to represent documents.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Engineering |
Subjects: | T Technology > TA Engineering (General). Civil engineering (General) |
Uncontrolled Keywords: | Semantics, stemming, cluster coherency, partitional clustering |
Publisher: | IOS Press |
ISSN: | 1327-2314 |
Last Modified: | 06 Jul 2023 10:18 |
URI: | https://orca.cardiff.ac.uk/id/eprint/48251 |
Citation Data
Cited 2 times in Scopus. View in Scopus. Powered By Scopus® Data
Actions (repository staff only)
Edit Item |