Alali, Abdulazeez and Theodorakopoulos, George ![]() ![]() |
Preview |
PDF
- Accepted Post-Print Version
Available under License Creative Commons Attribution. Download (286kB) | Preview |
Abstract
Recent advances in deep learning have enabled the creation of natural-sounding synthesised speech. However, attackers have also utilised these tech-nologies to conduct attacks such as phishing. Numerous public datasets have been created to facilitate the development of effective detection models. How-ever, available datasets contain only entirely fake audio; therefore, detection models may miss attacks that replace a short section of the real audio with fake audio. In recognition of this problem, the current paper presents the RFP da-taset, which comprises five distinct audio types: partial fake (PF), audio with noise, voice conversion (VC), text-to-speech (TTS), and real. The data are then used to evaluate several detection models, revealing that the available detec-tion models incur a markedly higher equal error rate (EER) when detecting PF audio instead of entirely fake audio. The lowest EER recorded was 25.42%. Therefore, we believe that creators of detection models must seriously consid-er using datasets like RFP that include PF and other types of fake audio.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Published Online |
Status: | Published |
Schools: | Computer Science & Informatics |
Publisher: | Springer |
ISBN: | 978-981-97-3972-1 |
Date of First Compliant Deposit: | 24 April 2024 |
Last Modified: | 05 Dec 2024 12:15 |
URI: | https://orca.cardiff.ac.uk/id/eprint/167934 |
Actions (repository staff only)
![]() |
Edit Item |