Liu, Han ![]() |
Preview |
PDF
- Accepted Post-Print Version
Download (479kB) | Preview |
Abstract
In machine learning tasks, it is essential for a data set to be partitioned into a training set and a test set in a specific ratio. In this context, the training set is used for learning a model for making predictions on new instances, whereas the test set is used for evaluating the prediction accuracy of a model on new instances. In the context of human learning, a training set can be viewed as learning material that covers knowledge, whereas a test set can be viewed as an exam paper that provides questions for students to answer. In practice, data partitioning has typically been done by randomly selecting 70% instances for training and the rest for testing. In this paper, we argue that random data partitioning is likely to result in the sample representativeness issue, i.e., training and test instances show very dissimilar characteristics leading to the case similar to testing students on material that was not taught. To address the above issue, we propose a subclass-based semi-random data partitioning approach. The experimental results show that the proposed data partitioning approach leads to significant advances in learning performance due to the improvement of sample representativeness.
Item Type: | Article |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Computer Science & Informatics |
Publisher: | Elsevier |
ISSN: | 0020-0255 |
Date of First Compliant Deposit: | 8 November 2018 |
Date of Acceptance: | 4 November 2018 |
Last Modified: | 02 Dec 2024 11:00 |
URI: | https://orca.cardiff.ac.uk/id/eprint/116524 |
Citation Data
Cited 3 times in Scopus. View in Scopus. Powered By ScopusĀ® Data
Actions (repository staff only)
![]() |
Edit Item |