Subclass-based semi-random data partitioning for improving sample representativeness

Liu, Han

, Chen, Shyi-Ming and Cocea, Mihaela 2019. Subclass-based semi-random data partitioning for improving sample representativeness. Information Sciences 478 , pp. 208-221. 10.1016/j.ins.2018.11.002

[thumbnail of Information Science Paper.pdf]

Preview

PDF - Accepted Post-Print Version
Download (479kB) | Preview

Official URL: http://dx.doi.org/10.1016/j.ins.2018.11.002

Abstract

In machine learning tasks, it is essential for a data set to be partitioned into a training set and a test set in a specific ratio. In this context, the training set is used for learning a model for making predictions on new instances, whereas the test set is used for evaluating the prediction accuracy of a model on new instances. In the context of human learning, a training set can be viewed as learning material that covers knowledge, whereas a test set can be viewed as an exam paper that provides questions for students to answer. In practice, data partitioning has typically been done by randomly selecting 70% instances for training and the rest for testing. In this paper, we argue that random data partitioning is likely to result in the sample representativeness issue, i.e., training and test instances show very dissimilar characteristics leading to the case similar to testing students on material that was not taught. To address the above issue, we propose a subclass-based semi-random data partitioning approach. The experimental results show that the proposed data partitioning approach leads to significant advances in learning performance due to the improvement of sample representativeness.

Item Type:	Article
Date Type:	Publication
Status:	Published
Schools:	Schools > Computer Science & Informatics
Publisher:	Elsevier
ISSN:	0020-0255
Date of First Compliant Deposit:	8 November 2018
Date of Acceptance:	4 November 2018
Last Modified:	02 Dec 2024 11:00
URI:	https://orca.cardiff.ac.uk/id/eprint/116524

Citation Data

Cited 3 times in Scopus. View in Scopus. Powered By Scopus® Data

Actions (repository staff only)

Edit Item

Dimensions

Altmetric

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)