Bian, Pengxin, Theodorakopoulos, Georgios ![]() ![]() |
Preview |
PDF
- Accepted Post-Print Version
Download (2MB) | Preview |
Preview |
PDF (Supplementary Material)
- Supplemental Material
Download (1MB) | Preview |
Abstract
Strings (sequences of elements) are often disseminated to support applications, e.g., in bioinformatics, web analysis, and transportation. Unfortunately, this may expose sensitive patterns that model confidential knowledge. Concealing the occurrences of sensitive patterns in a string (e.g. by deleting some elements) while minimizing the associated quality loss has been the objective of several string sanitization methods. However, real-world attackers are likely to possess background knowledge about the string, e.g., an individual's genome sequence is almost identical to a reference genome sequence. In addition, it is good security practice to assume that the attacker will know the algorithm that has been used to sanitize the string (Kerckhoffs’ principle). Yet, all existing methods fail to protect strings against such attackers, risking privacy breaches in critical applications. In our work, we consider for the first time how to defend against strategic attackers who possess such knowledge. To achieve this, we propose a novel framework to sanitize a string by probabilistically replacing carefully selected patterns. As part of this framework, we design three mathematical programming algorithms which compute the optimal replacement probabilities under different objectives and constraints, offering different privacy gain / quality loss tradeoffs. Our algorithms protect against strategic attackers using new concepts and measures, protect sensitive patterns of any length, and can construct one or more optimally sanitized strings that can be used in applications such as frequent pattern mining. Our experiments using five real-world datasets from different domains show that all our algorithms are substantially more effective than a natural baseline (e.g., they offer up to 2 times more privacy when they are configured to incur the same quality loss, and up to 3 times lower quality loss when they are configured to offer the same privacy). They also show that two "hybrid" algorithms that we propose, based on combining elements of the above algorithms, inherit the advantages of their constituent algorithms. These results, coupled with the generality of our approach, make our algorithms practical and beneficial for deployment.
Item Type: | Article |
---|---|
Date Type: | Published Online |
Status: | In Press |
Schools: | Schools > Computer Science & Informatics |
Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science |
Publisher: | Institute of Electrical and Electronics Engineers |
ISSN: | 1556-6013 |
Date of First Compliant Deposit: | 12 September 2025 |
Date of Acceptance: | 30 August 2025 |
Last Modified: | 22 Sep 2025 13:00 |
URI: | https://orca.cardiff.ac.uk/id/eprint/181065 |
Actions (repository staff only)
![]() |
Edit Item |