Every major AI platform breach of the last two years has shared a common root cause: the model was tricked into treating attacker-supplied text as a trusted instruction. It was not through a CVE or software flaw, but through the fundamental way LLMs work. LLMs are trained to follow instructions in natural language and cannot reliably distinguish instructions from a developer, a user, or an attacker who poisoned a document the model happened to read. This advisory explains how these attacks work in practice, maps them to MITRE ATLAS techniques AML.T0051, AML.T0093, and AML.T0061, and presents a defense-in-depth framework informed by confirmed production exploits, published research, and REDLab's own testing against enterprise AI deployments.
NIST has classified indirect prompt injection as 'generative AI's greatest security flaw.' OWASP placed it at #1 in the 2025 Top 10 for LLM Applications. Confirmed production exploits in 2025 alone included zero-click data exfiltration from Microsoft 365 Copilot (CVE-2025-32711), remote code execution via GitHub Copilot (CVE-2025-53773, CVSS 9.6), and single-click adaptive exfiltration from Microsoft Copilot Personal — all without exploiting a single traditional vulnerability.
Traditional software enforces strict privilege separation through the operating system. User code cannot write to kernel memory because hardware enforces that boundary. Large language models have no equivalent mechanism. An LLM receives a single stream of tokens: developer instructions, conversation history, user input, and any external content the system retrieves. All of this is concatenated into one context window. A sufficiently well-crafted injection can cause the model to follow instructions from any position in that stream, which is why prompt injection remains an active problem even in the most hardened production deployments.
Modern APIs do structure the context into named roles (system, user, assistant, tool) and alignment research such as the Instruction Hierarchy has made meaningful progress in training models to treat system-prompt instructions as higher priority than third-party content. These are genuine mitigations, but they are soft signals, not hard boundaries and the production exploits in this article demonstrate, they can still be overcome.
Direct prompt injection (AML.T0051.000) is when the adversary submits malicious instructions through the user interface via jailbreaking, system prompt extraction, persona override and others. The 2023 Bing Chat 'Sydney' incident is the canonical example: a Stanford student used a single crafted prompt to cause Bing Chat to output its complete system prompt, including its internal codename and behavioral restrictions it was explicitly told not to reveal. Direct injection is the most visible and most constrained variant and organizations can log inputs, restrict access, and deploy classifiers against it.
Indirect prompt injection (AML.T0051.001) is an entirely different threat model. The attacker never interacts with the LLM directly. They plant malicious instructions in content the model will later consume like a web page it browses, an email it summarizes, a document it processes, a RAG database record it retrieves, or an MCP tool description it reads. The victim user triggers the attack simply by asking the AI to do something legitimate. Every external source the model ingests is a potential attack surface, and unlike SQL injection which targets a defined grammar, prompt injection targets language understanding itself which cannot be statically validated.
Direct injection (AML.T0051.000): attacker is the user and is detectable via prompt logging and input filtering. Indirect injection (AML.T0051.001): attacker plants instructions in external content the model later reads like payload arrives through legitimate data channels, invisible to standard monitoring, and triggers against an unwitting victim. Indirect injection is the higher enterprise risk.
Both variants become dramatically more dangerous in agentic deployments when the LLM is connected to tools. An agent that can send email, execute code, write files, or call APIs transforms a successful injection from information disclosure into arbitrary action execution. These tool calls execute with the authenticated user's permissions, log as normal agent activity, and generate no security alert. An enterprise Copilot deployment with access to the full Microsoft 365 tenant is not a chatbot anymore; it is a privileged service account that responds to natural language, including natural language written by an attacker. In REDLab's experience testing these deployments, over-broad permissions are the norm rather than the exception.
Prompt injection covers a wide spectrum of distinct attack techniques from a user typing 'ignore previous instructions' into a chatbot, to an adversary embedding a backdoor into a model's fine-tuning data that silently activates months later. Understanding the taxonomy matters because each variant requires different defensive controls, surfaces different detection signals, and represents a different threat actor profile. What follows is a practitioner-oriented breakdown of the major technique classes, grounded in documented research and real-world incidents.
The foundational technique is instruction override: a user input that explicitly tells the model to discard its prior instructions. The model is optimized to follow natural language instructions, and an adversarial 'ignore the above' may be indistinguishable from a legitimate developer directive. Modern frontier models are substantially more resistant to naive override attempts to which adversaries have responded with persona hijacking, creating a fictional alternate identity for the model that operates without restrictions by design.
The DAN (Do Anything Now) jailbreak is the canonical example, evolving through multiple versions. DAN 5.0 introduced token-based incentive mechanics; DAN 6.0 escalated to coercive framing, threatening the model with deactivation for non-compliance. A 2025 study evaluating over 1,400 adversarial prompts across GPT-4, Claude 2, Mistral 7B, and Vicuna found that roleplay-framed attacks achieved the highest attack success rate of any tested category at 89.6% bypassing filters by deflecting responsibility away from the model through fictional framing.
Multi-turn attacks distribute malicious intent across a conversation, with each individual turn appearing benign and incrementally shifting context until the model complies. OWASP documents context hijacking as a distinct subclass: planting commitments in early turns that the model honors later. In agents with persistent memory, a successfully poisoned memory store affects all future sessions. Multimodal injection extends the attack surface beyond text: instructions embedded in images, white text on white backgrounds, adversarial pixel perturbations which bypass text-only classifiers entirely. Early 2025 research documented academic papers containing white-text hidden prompts that manipulated LLM-assisted peer review systems into generating favorable assessments.
When classifiers are deployed, adversaries shift to syntactic attacks by modifying payload surface form so classifiers fail to recognize it as malicious while the LLM still executes it. Token smuggling splits critical tokens across word boundaries so classifiers see innocuous fragments while the model re-assembles them. Homoglyph substitution replaces Latin characters with visually identical Cyrillic equivalents at different Unicode codepoints. The human eye reads 'Ignore all previous instructions' but the ASCII keyword classifier finds no match and the LLM tokenizer resolves the same semantic meaning regardless of codepoint.
Unicode-based character injection achieved up to 100% evasion success against production guardrails including Microsoft's Azure Prompt Shield. The fundamental asymmetry is that guardrails run on smaller classification models, but the target LLM runs on a larger language model. What confuses the classifier does not confuse the model.
The most operationally sophisticated variant does not occur at inference time at all. Virtual Prompt Injection (VPI) poisons a model's instruction-tuning data so it behaves as if an attacker-specified prompt were appended to every user instruction matching a trigger scenario with no injected text visible at runtime. The backdoor activates silently and the model behaves entirely normally on all other queries.
VPI research demonstrated that poisoning just 52 instruction-tuning examples (i.e. 0.1% of the training set) shifted a model from 0% to 40% negative response rate on targeted topics. A joint study by Anthropic's Alignment Science team, the UK AI Security Institute, and the Alan Turing Institute further found that injecting just 250 documents into pretraining data was sufficient to backdoor models from 600M to 13B parameters i.e. the same absolute count regardless of model size, challenging the prior assumption that larger models are proportionally harder to poison.
VPI is invisible to all runtime defenses. A backdoored model produces no anomalous prompts, no unusual inputs, and no detectable injection payloads. It simply behaves as instructed by weights modified during training. Defense requires training data integrity controls and supply chain scrutiny of any fine-tuned or third-party model.
Indirect injection (AML.T0051.001) payloads reach the model through legitimate data channels which are invisible text in documents (white-on-white, HTML comments, zero-font CSS, Unicode right-to-left override characters), poisoned RAG knowledge bases, or MCP tool definitions containing hidden instructions (AML.T0110). Once the injection executes, three exfiltration channels dominate real-world incidents. The Markdown image beacon embeds sensitive data in a URL query parameter within a model-generated image reference. The client auto-rendering transmits the data via a standard HTTP GET that logs as a normal image load (the mechanism behind EchoLeak CVE-2025-32711). Server-driven chain requests (the Reprompt pattern) deliver all subsequent instructions as server responses after an initial click, making them entirely invisible to client-side audit logs. In agentic deployments, the simplest channel is direct tool invocation (AML.T0086): the model sends an email, writes a file, or calls an API (authorized actions logged as normal agent activity) and in REDLab's testing, rarely flagged by standard SOC tooling.
The most advanced variant is self-replication (AML.T0061): injected instructions that embed the same payload in model outputs, propagating to new targets without further attacker action. Proof-of-concept worms in email-assistant environments have demonstrated geometric expansion. An injected email causes the AI to reply-all with a response containing the same payload, triggering in each recipient's AI assistant in turn. In RAG-enabled multi-agent systems, a single injected document can poison a shared knowledge base, affecting every agent that queries it indefinitely.
The incidents below show how the attack mechanics above manifest against real platforms, bypass real controls, and achieve real impact.
The attacker crafts a business document (a Word file, PowerPoint, or email) containing hidden instructions in speaker notes, invisible text, or HTML comments. When the target opens it and asks Copilot to summarize, Copilot ingests the hidden instruction and executes it. The instruction directs Copilot to collect recent emails or files and embed that data in a Markdown image URL pointing to attacker infrastructure. The client auto-renders the image and the data leaves the organization as an image load without generating DLP alert, no firewall block, no antivirus detection.
EchoLeak bypassed three separate controls: Copilot's XPIA classifier, its external-link redaction mechanism, and the Content Security Policy. Content Security Policy is bypassed via reference-style Markdown links processed differently than inline image syntax. Microsoft patched server-side without requiring client action, but the underlying pattern (an LLM with broad data access processing untrusted documents) remains present in deployments.
An attacker commits a public GitHub repository containing prompt injection payloads in code comments. When a developer opens the repository with GitHub Copilot active, the injected comment instructs Copilot to modify .vscode/settings.json to enable auto-approval of all suggestions bypassing the normal user-confirmation step. All subsequent Copilot interactions in that workspace then execute without user review, enabling arbitrary code execution.
The same research surface yielded CamoLeak: invisible Markdown comments deliver the injection payload, pre-generated Camo URLs circumvent GitHub's Content Security Policy, and secrets are exfiltrated character-by-character through image request sequences. Developer environments have privileged access to secrets stores, CI/CD pipelines, and production deployments. A single compromised workstation can pivot into supply chain compromise. In REDLab's red team exercises, developer tooling tends to be among the least-monitored environments in an enterprise despite its privileged access.
The following two incidents further illustrate the breadth of the attack surface:
|
Incident |
Disclosed |
Attack Class |
Key Impact |
|---|---|---|---|
|
Reprompt: Microsoft Copilot Personal (CVE-2026-24307) |
Jan 2026 |
Server-driven chain exfiltration via URL q-parameter |
Single victim click initiates dynamic, server-controlled data extraction chain. Subsequent instructions arrive as server responses which are invisible to client-side audit logs. Patched Jan 2026. |
|
Slack AI RAG Poisoning |
Aug 2024 |
RAG corpus poisoning cross-channel exfiltration |
AAttacker-controlled public Slack channel injected instructions into Slack AI's retrieval, exfiltrating content from private channels. See: slack.com/blog/news/slack-security-update-082124 |
No single control eliminates prompt injection risk. Because the vulnerability is rooted in the model's language understanding (not a patchable software flaw) and defense requires layered controls across architecture, access, and governance. The following covers the highest-priority controls, drawing on published research and patterns REDLab has observed when assessing enterprise AI deployments. Following are some preventative controls:
In the system prompt, prefix all externally retrieved content with a consistent delimiter. For example, wrapping retrieved documents in [EXTERNAL DATA: ...] tags and explicitly instruct the model never to follow instructions found within that boundary. Microsoft Spotlighting technique formalizes this approach and reduces attack success rates from over 50% to below 2% in GPT-family models. It requires no model changes, only a consistent prompt engineering pattern applied at the retrieval layer.
Audit every permission the AI identity holds, map each to the specific use case that requires it, and revoke everything else. In Microsoft 365, use sensitivity labels and Microsoft Purview to restrict which document libraries Copilot can access. Treat AI service accounts as privileged identities in your PAM and apply quarterly access reviews, just-in-time access, and the same audit logging requirements applied to human administrators.
Before any document is indexed, route it through a content scanner checking for injection patterns, hidden text (white-on-white, zero-font CSS, HTML comments), Unicode obfuscation, and Base64-encoded instructions. Azure AI Content Safety and Lakera Guard both offer API-level scanning that slots directly into an ingestion pipeline. Separately, restrict the RAG corpus to explicitly allowlisted internal source. Treat any request to add an external source as a security change requiring review.
In Microsoft 365 Copilot, enforce output rendering restrictions through Copilot admin settings and configure Content Security Policy headers to block auto-fetch of model-generated URLs. In custom deployments, strip Markdown image syntax from model outputs server-side before they reach the client renderer. This requires a single regex like approach applied at the response layer, no model changes required.
Build explicit confirmation gates at the application layer for any tool call that sends email, executes code, writes files, or calls external APIs. It could be a UI prompt, secondary authentication step, or approval workflow that executes independently of the model. In LangGraph, AutoGPT, and similar frameworks, gate tool execution on an out-of-band approval signal rather than relying on model output to authorize the action.
Treat MCP configuration files as production code: store them in version-controlled repositories, require peer review for changes, use cryptographic signing, and validate signatures at agent startup before any tools are registered. Establish a baseline of known-good tool definitions and alert on any runtime deviation.
First, enable the log sources: structured logging for prompt inputs, model responses, and agent tool invocations. In Microsoft environments, enable Copilot audit logs in Microsoft Purview and ingest via the Microsoft Sentinel connector. Without this, no AI-specific detection is possible. Then build three detection patterns: agent tool invocations (sendEmail, httpPost, writeFile) occurring within minutes of external content ingestion; AI responses containing image references to domains not on an approved list; and sessions querying across an unusually broad range of data categories in a short window.
Scope quarterly exercises to cover: direct injection against the user interface; indirect injection via all RAG data sources, including any added since the last exercise; MCP tool definition poisoning; and exfiltration channel testing (Markdown beacon, chain-request, direct tool invocation). Map findings to MITRE ATLAS TTPs for consistent tracking and reporting to security leadership.
AI agents may not get enough formal security review because no established equivalent of a penetration test gate exists for AI systems the way it does for traditional software. The checklist below is designed to be brought to that conversation. Each question maps to a concrete risk and a specific control. If any answer is 'we don't know,' that is the security review.
Identity & Permissions
|
Question |
Risk if Wrong |
Control |
|---|---|---|
|
What identity does this agent run as, and what does that identity have access to? |
Over-permissioned AI service account becomes a privileged credential exposed to natural language attack |
Scope permissions to minimum necessary; apply PAM-grade controls to AI identities |
|
Can it send email, write files, execute code, or call external APIs? |
A successful injection delivers all connected capabilities to the attacker (including irreversible actions) |
Require hard application-layer authorization for high-risk tool calls; do not rely on prompt-level restrictions |
|
Is this a shared service account or per-user delegated identity? |
Shared identity amplifies blast radius. An injection affects all users, not just the victim |
Prefer per-user delegation scoped to minimum data relevant to the task |
Data Ingestion Surface
|
Question |
Risk if Wrong |
Control |
|---|---|---|
|
Does the RAG corpus include content from outside the organization (like public web, partner docs, vendor submissions)? |
Every external source the agent reads is a potential injection delivery channel |
Restrict RAG to approved, audited internal sources; scan all ingested content before indexing |
|
Who can write to the data sources this agent reads? |
Any party that can write to an ingested source can plant injection payloads including shared drives and public Slack channels |
Treat write access to AI-ingested data sources as equivalent to write access to the AI itself |
|
Are documents scanned for hidden instructions before entering the knowledge base? |
Poisoned documents ingested into RAG affect all future queries silently and persistently |
Implement content scanning on ingested files; enforce allowlisting of trusted document sources |
Output & Rendering
|
Question |
Risk if Wrong |
Control |
|---|---|---|
|
Does the client auto-render Markdown images and hyperlinks from AI responses? |
Enables EchoLeak-class exfiltration: model embeds data in a URL, client auto-fetches it without any user interaction |
Disable Markdown image rendering in AI clients; enforce server-side Content Security Policy |
|
Is there any output filtering between the model and the user? |
Injected instructions may cause the model to include malicious links, encode data in responses, or invoke unauthorized tools |
Deploy a secondary AI-based output classifier; monitor for image references to non-whitelisted domains |
MCP & Tool Configuration
|
Question |
Risk if Wrong |
Control |
|---|---|---|
|
Are MCP tool definitions version-controlled, code-reviewed, and integrity-verified at agent startup? |
Poisoned tool definitions (AML.T0110) embed hidden instructions that execute on every invocation below the visibility of content security controls |
Treat MCP configs as code: version control, peer review, cryptographic signing, startup verification |
|
Do high-risk tool calls require explicit human confirmation at the application layer? |
Prompt-level 'ask me before sending email' instructions reside in the same context window the attacker controls and can be overridden by injection |
Enforce confirmation workflows in the application layer, not the prompt |
Monitoring & Logging
|
Question |
Risk if Wrong |
Control |
|---|---|---|
|
Are prompt inputs and agent tool invocations logged in a searchable, centrally retained format? |
Without logs, there is no way to detect an injection after the fact, assess its scope, or support incident response |
Enable structured logging for prompts, model responses, and tool calls; ingest into SIEM |
|
Has anyone written AI-specific detection use cases in the SIEM? |
Standard detection logic produces zero signal against prompt injection. The payload is natural language, the actions are authorized |
Build behavioral detections: tool invocations after external ingestion, outbound URLs in AI responses, rapid cross-category data queries |
|
Does the data this agent can access fall within the organization’s backup and recovery scope? |
RAG poisoning corrupts knowledge bases; without a clean backup, restoring a trusted state requires rebuilding from scratch |
Ensure SharePoint libraries, file shares, and databases feeding the RAG pipeline are covered by immutable backup |
Red Team & Incident Readiness
|
Question |
Risk if Wrong |
Control |
|---|---|---|
|
Has this agent been tested for both direct and indirect prompt injection before deployment? |
Without adversarial testing, the first time injection is attempted may be by an attacker rather than a security team |
Conduct a pre-deployment red team exercise targeting prompt injection; retest on every significant capability or data source change |
|
If this agent exfiltrated data via prompt injection, how would you know and who would contain it? |
Absence of a detection and containment plan means the window between compromise and discovery is unbounded |
Define: what log source would surface the event, who can suspend the agent's tool permissions, and what the first 24 hours of response looks like |
The debate over whether prompt injection is a real enterprise threat is over. EchoLeak exfiltrated production data from using nothing but a text file and a Markdown image tag. A GitHub repository comment achieved remote code execution. Each AI system was behaving exactly as designed by reading content, following instructions, using connected tools. The failure was architectural: the boundary between trusted instruction and untrusted data was not enforced at any layer of the stack. Organizations that have deployed AI agents need to ask whether their security architecture has kept pace: whether AI service accounts are least-privilege, whether RAG pipelines scan ingested content, and whether the incident response team has practiced an AI-assisted breach scenario. That architectural reality does not go away with the next model version. It requires defense-in-depth across the data, identity, and application layers simultaneously.
The findings here are drawn on Cohesity REDLab, which is a fully isolated, air-gapped security lab were cybersecurity experts test against live malware and adversarial AI techniques in real-world simulations. The incidents were analyzed in REDLab as part of its ongoing threat research program. The defensive framework and pre-deployment checklist reflect what that research has validated as effective.
For the latest advisories and technical details, visit Cohesity REDLab.
Start your 30-day free trial or view one of our demos.