Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

MaWGAN: a generative adversarial network to create synthetic data from datasets with missing data

Poudevigne-Durance, Thomas, Jones, Owen Dafydd ORCID: https://orcid.org/0000-0002-7300-5510 and Qin, Yipeng ORCID: https://orcid.org/0000-0002-1551-9126 2022. MaWGAN: a generative adversarial network to create synthetic data from datasets with missing data. Electronics 11 (6) , 837. 10.3390/electronics11060837

[thumbnail of electronics-11-00837.pdf]
Preview
PDF - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract

The creation of synthetic data are important for a range of applications, for example, to anonymise sensitive datasets or to increase the volume of data in a dataset. When the target dataset has missing data, then it is common to just discard incomplete observations, even though this necessarily means some loss of information. However, when the proportion of missing data are large, discarding incomplete observations may not leave enough data to accurately estimate their joint distribution. Thus, there is a need for data synthesis methods capable of using datasets with missing data, to improve accuracy and, in more extreme cases, to make data synthesis possible. To achieve this, we propose a novel generative adversarial network (GAN) called MaWGAN (for masked Wasserstein GAN), which creates synthetic data directly from datasets with missing values. As with existing GAN approaches, the MaWGAN synthetic data generator generates samples from the full joint distribution. We introduce a novel methodology for comparing the generator output with the original data that does not require us to discard incomplete observations, based on a modification of the Wasserstein distance and easily implemented using masks generated from the pattern of missing data in the original dataset. Numerical experiments are used to demonstrate the superior performance of MaWGAN compared to (a) discarding incomplete observations before using a GAN, and (b) imputing missing values (using the GAIN algorithm) before using a GAN

Item Type: Article
Date Type: Publication
Status: Published
Schools: Mathematics
Additional Information: This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/
Publisher: MDPI
ISSN: 2079-9292
Date of First Compliant Deposit: 4 March 2022
Date of Acceptance: 4 March 2022
Last Modified: 22 Mar 2024 12:27
URI: https://orca.cardiff.ac.uk/id/eprint/148018

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics