Adversarial backdoors in Deep Reinforcement Learning: Novel supply chain vulnerabilities and detection strategies

Vyas, Sanyam 2025. Adversarial backdoors in Deep Reinforcement Learning: Novel supply chain vulnerabilities and detection strategies. PhD Thesis, Cardiff University.
Item availability restricted.

Preview	PDF (Thesis) - Accepted Post-Print Version Available under License Creative Commons Attribution Non-commercial No Derivatives. Download (2MB) \| Preview
	PDF (Cardiff University Electronic Publication Form) - Supplemental Material Restricted to Repository staff only Download (299kB)

Abstract

The rapid integration of Deep Reinforcement Learning (DRL) into real-world infrastructure such as autonomous vehicles, cyber defence and healthcare has introduced critical vulnerabilities in the AI supply chain. This thesis investigates adversarial backdoors (also known as trojans) in which an adversary embeds hidden malicious behaviours that remain dormant until activated by a specific environmental trigger. Unlike traditional adversarial examples, backdoors exploit the overparameterisation of neural networks to bypass standard validation protocols while maintaining high performance on clean data. This research challenges the unrealistic high-privilege assumptions in existing literature, which typically assume an adversary has full control of the training pipeline. Instead, it introduces three novel attack vectors that operate under substantially reduced adversarial privileges. TrojanentRL shows that backdoors can be injected in DRL by corrupting underaudited auxiliary components, specifically the Rollout Buffer, without modifying the main codebase. InfrectroRL demonstrates data-free, post-training attacks that manipulate DRL model weights after training but before deployment. InfRLhammer introduces the first inference-phase exclusive backdoor in DRL by using hardware-level Rowhammer-induced bit-flips in DRAM to subvert an agent trained under perfectly benign conditions. To counter these threats, this thesis proposes Neural Watchdog, a real-time detection framework that identifies backdoors by monitoring internal neural activation patterns. It shifts defence from external observations, which can be bypassed by in-distribution triggers, to the agent’s internal neural activations. Experiments in MiniGrid and Atari show the attacks evade state-of-the-art sanitisation defences while maintaining high clean data accuracy. In practice, a compromised RL agent could cause catastrophic failure, from ignoring traffic cues to triggering harm or data exfiltration. Although evaluation was limited to benchmark environments like MiniGrid and Atari, the work expands the backdoor threat landscape and provides a clear pathway to extend these defences to domains such as Autonomous Cyber Network Defence and Autonomous Vehicles.

Item Type:	Thesis (PhD)
Date Type:	Completion
Status:	Unpublished
Schools:	Schools > Computer Science & Informatics
Subjects:	Q Science > QA Mathematics > QA75 Electronic computers. Computer science Q Science > QA Mathematics > QA76 Computer software
Funders:	EPSRC, Airbus
Date of First Compliant Deposit:	18 March 2026
Date of Acceptance:	11 March 2026
Last Modified:	23 Mar 2026 16:29
URI:	https://orca.cardiff.ac.uk/id/eprint/185850

Actions (repository staff only)

Edit Item

Download Statistics

Downloads

Downloads per month over past year

View more statistics

CORE (COnnecting REpositories)