Suarez-Alvarez, Maria M., Pham, Duc Truong, Prostov, Mikhail Y. and Prostov, Yuriy I. 2012. Statistical approach to normalization of feature vectors and clustering of mixed datasets. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 468 (2145) , pp. 2630-2651. 10.1098/rspa.2011.0704 |
Abstract
Normalization of feature vectors of datasets is widely used in a number of fields of data mining, in particular in cluster analysis, where it is used to prevent features with large numerical values from dominating in distance-based objective functions. In this study, a unified statistical approach to normalization of all attributes of mixed databases, when different metrics are used for numerical and categorical data, is proposed. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. Formulae for the statistically normalized Minkowski mixed p-metrics are given in an explicit way. It is shown that the classic z-score standardization and the min–max normalization are particular cases of the statistical normalization, when the objective function is, respectively, based on the Euclidean or the Tchebycheff (Chebyshev) metrics. Finally, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the k-prototypes (for p=2) or another algorithm (for p≠2).
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Engineering |
Subjects: | T Technology > TA Engineering (General). Civil engineering (General) |
Uncontrolled Keywords: | clustering; normalization; standardization; Minkowski metrics; statistics |
Publisher: | Royal Society |
ISSN: | 1364-5021 |
Last Modified: | 10 Oct 2017 15:19 |
URI: | https://orca.cardiff.ac.uk/id/eprint/52925 |
Citation Data
Cited 63 times in Scopus. View in Scopus. Powered By Scopus® Data
Actions (repository staff only)
![]() |
Edit Item |