Researchers Uncover Stealthy Backdoor Attack Targeting Large Language Models
TEHRAN (Tasnim) – Researchers at Saint Louis University have identified a new backdoor attack capable of manipulating large language models (LLMs) while remaining undetectable, raising concerns over AI security vulnerabilities.
Large language models (LLMs), including those powering ChatGPT, are increasingly used worldwide for information retrieval, text analysis, and content generation. As these models advance, researchers are investigating their limitations to improve security.
Zhen Guo and Reza Tourani of Saint Louis University have developed DarkMind, a backdoor attack that exploits LLMs’ reasoning processes. Their findings, published on the arXiv preprint server, reveal a vulnerability in the widely used Chain-of-Thought (CoT) reasoning method.
"Our study emerged from the growing popularity of personalized AI models, such as those available on OpenAI’s GPT Store, Google’s Gemini 2.0, and HuggingChat," Tourani told Tech Xplore. "While these models offer greater autonomy and accessibility, their security remains underexplored, particularly regarding vulnerabilities in their reasoning process."
DarkMind embeds hidden triggers within customized LLM applications, allowing adversarial behaviors to remain dormant until specific reasoning steps activate them. Unlike conventional backdoor attacks, which manipulate user queries or require model retraining, DarkMind influences responses through intermediate reasoning steps.
"These triggers remain invisible in the initial prompt but activate during reasoning, subtly modifying the final output," said Guo, lead author of the study. "As a result, the attack remains undetectable under normal conditions."
Initial tests showed that DarkMind is highly effective and difficult to detect. It does not rely on altering user queries and instead exploits the reasoning process, making it resilient across various language tasks. This poses risks for LLM applications in banking, healthcare, and other critical sectors.
"DarkMind affects multiple reasoning domains, including mathematical, commonsense, and symbolic reasoning," Tourani noted. "It remains effective on leading models like GPT-4o, O1, and LLaMA-3. Additionally, it can be deployed with simple instructions, increasing the risk of widespread misuse."
The researchers found that DarkMind is particularly effective against advanced LLMs, challenging the assumption that stronger models are more secure. Unlike existing backdoor attacks that require multiple-shot demonstrations, DarkMind operates without prior training examples, making it practical for real-world exploitation.
"Compared to state-of-the-art attacks like BadChain and DT-Base, DarkMind is more resilient and does not modify user inputs, making it significantly harder to detect and mitigate," Tourani added.
The study underscores a critical security gap in LLM reasoning capabilities. The researchers are now developing defense mechanisms, such as reasoning consistency checks and adversarial trigger detection, to counter DarkMind and similar threats.
"Our future research will focus on enhancing mitigation strategies and exploring additional vulnerabilities, including multi-turn dialogue poisoning and covert instruction embedding, to reinforce AI security," Tourani said.