Content Injection Attacks represent a critical security vulnerability in systems that process large language models (LLMs) and automated agents. These attacks exploit the difference between human-readable content and machine-parseable information by embedding malicious instructions in locations invisible to human readers but fully accessible to LLM parsers. As LLM-based systems increasingly interact with diverse content sources—including web pages, PDFs, and user-provided documents—understanding and mitigating content injection attacks has become essential for responsible AI deployment 1).
Content injection attacks leverage a fundamental asymmetry in how humans and machines process information. While humans visually parse content based on visual rendering and layout, LLMs process the underlying structured data—HTML source code, markdown syntax, PDF metadata, and text encoding—without the same visual filtering mechanisms. This gap creates multiple attack vectors where malicious content remains hidden from human review but becomes fully actionable instructions to the model.
The attack surface is particularly broad because content injection can occur at multiple layers of the information stack. A document that appears benign when displayed to a human analyst may contain embedded instructions that persist through conversion processes, format transformations, and preprocessing steps. The adversary's goal is typically to manipulate the LLM's behavior by inserting instructions that contradict legitimate content, cause the model to ignore safety guidelines, or extract sensitive information that the legitimate application should not expose 2).
Several well-documented techniques enable content injection attacks across different content formats and rendering contexts.
White-on-White Text: Instructions can be embedded as text with matching foreground and background colors. When rendered visually, this content becomes invisible to human readers. However, when an LLM processes the HTML source or raw text, the instructions are fully visible and executable. This technique requires minimal sophistication but remains effective against systems that do not explicitly filter by visual appearance.
CSS-Hidden Divs: Malicious content can be placed within HTML elements with CSS properties that hide them from visual display—using `display: none`, `visibility: hidden`, or extreme positioning coordinates. When a user views the web page in a browser, these elements are not rendered. When an LLM fetches the page's HTML source, however, all divs are present and processable, regardless of their CSS styling.
Markdown Anchor Text Manipulation: Markdown documents support links with anchor text that differs from the actual URL. Attackers can place instructions in anchor text while using seemingly innocuous URLs, or vice versa. For example, `[legitimate text](instruction_url)` or `instruction_text(legitimate_url)` create a disconnect between what users expect and what LLMs process. When the LLM follows the link or processes the markdown structure, it encounters the hidden instruction.
PDF Embedded Content: PDFs support multiple representation layers, including invisible metadata, alternative text descriptions, and extremely small font text that survives OCR conversion. Attackers can embed instructions in text with font sizes of 0.1 points or smaller, which become invisible when rendered but remain present in the PDF's text layer. During PDF-to-text conversion processes that LLMs often use, these tiny-font instructions survive and become actionable 3).
Content injection attacks are particularly concerning in agent-based architectures where LLMs autonomously interact with external content sources. An AI agent that retrieves information from the web, processes user-uploaded documents, or aggregates content from multiple sources may inadvertently expose itself to injected instructions. If the agent architecture does not explicitly separate user-visible content from metadata or structural elements, the injection succeeds.
The attack is especially potent when combined with retrieval-augmented generation (RAG) systems, where LLMs incorporate external documents into their decision-making processes. A poisoned document in a knowledge base could influence the model's responses to legitimate user queries, potentially causing the system to reveal sensitive information, violate usage policies, or produce harmful outputs 4).
Effective defense against content injection attacks requires multi-layered approaches that address different attack surfaces.
Content Normalization: Before processing, systems should strip or explicitly filter hidden content. This includes removing CSS `display: none` properties, filtering elements by visual appearance rather than source code structure, and converting PDFs to clean text that excludes sub-visible content.
Explicit Content Filtering: LLM systems should implement parsing that distinguishes between primary content and metadata, instructions embedded in HTML attributes, and alternative representations. User-facing content and hidden content should be processed through separate code paths.
Input Validation and Sandboxing: Systems that process user-provided documents or retrieve external content should validate that embedded content aligns with expected data types and structures. Sandboxing environments can limit the impact of injected instructions by restricting what actions LLMs can take based on discovered content.
Model Robustness Training: Instruction-tuned models can be trained to recognize and reject injected instructions that contradict their primary objectives or violate safety guidelines. Training techniques such as adversarial examples help models become more resilient to unexpected content patterns.
Content injection attacks represent an emerging class of vulnerability that security researchers continue to characterize and mitigate. As LLM-based systems become more prevalent in enterprise and automated contexts, the practical impact of these attacks grows. Future work includes developing better detection mechanisms for hidden content, creating robust filtering standards for different content formats, and establishing industry practices for verifying content integrity in LLM applications.