Enhanced Cross-Domain Document Clustering with a Semantically Enhanced Text Stemmer (SETS)

Stankov, Ivan Dimitrov, Todorov, Diman and Setchi, Rossitza

2013. Enhanced Cross-Domain Document Clustering with a Semantically Enhanced Text Stemmer (SETS). International Journal of Knowledge-based and Intelligent Engineering Systems 17 (2) , pp. 113-126. 10.3233/KES-130267

Full text not available from this repository.

Official URL: http://dx.doi.org/10.3233/KES-130267

Abstract

The aim of document clustering is to produce coherent clusters of similar documents. Clustering algorithms rely on text normalisation techniques to represent and cluster documents. Although most document clustering algorithms perform well in specific knowledge domains, processing cross-domain document repositories is still a challenge. This paper attempts to address this challenge. It investigates the performance of the sk-means clustering algorithm across domains, by comparing the cluster coherence produced with semantic-based and traditional (TF-IDF-based) document representations. The evaluation is conducted on 20 different generic sub-domains of a thousand documents, each randomly selected from the Reuters21578 corpus. The experimental results obtained from the evaluation demonstrate improved coherence of clusters produced by using a semantically enhanced text stemmer (SETS), when compared to the text normalisation obtained with the Porter stemmer. In addition, semantic-based text normalisation is shown to be resistant to noise, which is often introduced in the index aggregation stage, a stage that acquires features to represent documents.

Item Type:	Article
Date Type:	Publication
Status:	Published
Schools:	Schools > Engineering
Subjects:	T Technology > TA Engineering (General). Civil engineering (General)
Uncontrolled Keywords:	Semantics, stemming, cluster coherency, partitional clustering
Publisher:	IOS Press
ISSN:	1327-2314
Last Modified:	06 Jul 2023 10:18
URI:	https://orca.cardiff.ac.uk/id/eprint/48251

Citation Data

Cited 2 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item

Dimensions

Altmetric

CORE (COnnecting REpositories)