Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Core-set selection for data-efficient land cover segmentation

Nogueira, Keiller, Zaytar, Akram, Ma, Wanli, Roscher, Ribana, Hänsch, Ronny, Robinson, Caleb, Ortiz, Anthony, Nsutezo, Simone, Dodhia, Rahul, Lavista Ferres, Juan M., Karakus, Oktay ORCID: https://orcid.org/0000-0001-8009-9319 and Rosin, Paul L. ORCID: https://orcid.org/0000-0002-4965-3884 2026. Core-set selection for data-efficient land cover segmentation. IEEE Access 10.1109/access.2026.3659734

[thumbnail of Core-Set_Selection_for_Data-efficient_Land_Cover_Segmentation.pdf] PDF - Accepted Post-Print Version
Available under License Creative Commons Attribution.

Download (12MB)

Abstract

The increasing accessibility of remotely sensed data and their potential to support large-scale decision-making have driven the development of deep learning models for many Earth Observation tasks. Traditionally, such models rely on large datasets. However, the common assumption that larger training datasets lead to better performance tends to overlook issues related to data redundancy, noise, and the computational cost of processing massive datasets. Effective solutions must therefore consider not only the quantity but also the quality of data. Towards this, in this paper, we introduce six basic core-set selection approaches – that rely on imagery only, labels only, or a combination of both – and investigate whether they can identify high-quality subsets of data capable of maintaining – or even surpassing – the performance achieved when using full datasets for remote sensing semantic segmentation. We benchmark such approaches against two traditional baselines on three widely used land-cover classification datasets (DFC2022, Vaihingen, and Potsdam) using two different architectures (SegFormer and U-Net), thus establishing a general baseline for future works. Our experiments show that all proposed methods consistently outperform the baselines across multiple subset sizes, with some approaches even selecting core sets that surpass training on all available data. Notably, on DFC2022, a selected subset comprising only 25% of the training data yields slightly higher SegFormer performance than training with the entire dataset. This result shows the importance and potential of data-centric learning for the remote sensing domain. The code is available at https://github.com/keillernogueira/data-centric-rs-classification/.

Item Type: Article
Date Type: Published Online
Status: In Press
Schools: Schools > Computer Science & Informatics
Publisher: Institute of Electrical and Electronics Engineers
ISSN: 2169-3536
Date of First Compliant Deposit: 9 February 2026
Last Modified: 09 Feb 2026 14:00
URI: https://orca.cardiff.ac.uk/id/eprint/184548

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics