AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

Related collections

Author and article information

Journal

Publication date Created: 17 July 2024

Article

ArXiV ID: 2407.12784

SO-VID: 9bd05ce7-fd97-4499-b580-d36deade00b7

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 22 pages, 13 figures, 7 tables

Categories cs.LG cs.CR cs.IR

ScienceOpen disciplines: Information & Library science,Security & Cryptology,Artificial intelligence

Data availability:

ScienceOpen disciplines: Information & Library science, Security & Cryptology, Artificial intelligence

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Read this article at

Abstract

Related collections

Semantic Knowledge Base

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 316