|
Vyas, Sanyam
2025.
Adversarial backdoors in Deep Reinforcement Learning:
Novel supply chain vulnerabilities and detection strategies.
PhD Thesis,
Cardiff University.
Item availability restricted. |
Preview |
PDF (Thesis)
- Accepted Post-Print Version
Available under License Creative Commons Attribution Non-commercial No Derivatives. Download (2MB) | Preview |
|
PDF (Cardiff University Electronic Publication Form)
- Supplemental Material
Restricted to Repository staff only Download (299kB) |
Abstract
The rapid integration of Deep Reinforcement Learning (DRL) into real-world infrastructure such as autonomous vehicles, cyber defence and healthcare has introduced critical vulnerabilities in the AI supply chain. This thesis investigates adversarial backdoors (also known as trojans) in which an adversary embeds hidden malicious behaviours that remain dormant until activated by a specific environmental trigger. Unlike traditional adversarial examples, backdoors exploit the overparameterisation of neural networks to bypass standard validation protocols while maintaining high performance on clean data. This research challenges the unrealistic high-privilege assumptions in existing literature, which typically assume an adversary has full control of the training pipeline. Instead, it introduces three novel attack vectors that operate under substantially reduced adversarial privileges. TrojanentRL shows that backdoors can be injected in DRL by corrupting underaudited auxiliary components, specifically the Rollout Buffer, without modifying the main codebase. InfrectroRL demonstrates data-free, post-training attacks that manipulate DRL model weights after training but before deployment. InfRLhammer introduces the first inference-phase exclusive backdoor in DRL by using hardware-level Rowhammer-induced bit-flips in DRAM to subvert an agent trained under perfectly benign conditions. To counter these threats, this thesis proposes Neural Watchdog, a real-time detection framework that identifies backdoors by monitoring internal neural activation patterns. It shifts defence from external observations, which can be bypassed by in-distribution triggers, to the agent’s internal neural activations. Experiments in MiniGrid and Atari show the attacks evade state-of-the-art sanitisation defences while maintaining high clean data accuracy. In practice, a compromised RL agent could cause catastrophic failure, from ignoring traffic cues to triggering harm or data exfiltration. Although evaluation was limited to benchmark environments like MiniGrid and Atari, the work expands the backdoor threat landscape and provides a clear pathway to extend these defences to domains such as Autonomous Cyber Network Defence and Autonomous Vehicles.
| Item Type: | Thesis (PhD) |
|---|---|
| Date Type: | Completion |
| Status: | Unpublished |
| Schools: | Schools > Computer Science & Informatics |
| Subjects: | Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software |
| Funders: | EPSRC, Airbus |
| Date of First Compliant Deposit: | 18 March 2026 |
| Date of Acceptance: | 11 March 2026 |
| Last Modified: | 23 Mar 2026 16:29 |
| URI: | https://orca.cardiff.ac.uk/id/eprint/185850 |
Actions (repository staff only)
![]() |
Edit Item |




Download Statistics
Download Statistics