Prompt Injection and the OWASP LLM Top 10: A Field Guide for Defenders

TL;DR for defenders. Treat every LLM in your stack as an untrusted component that will eventually follow someone else's instructions. There is no patch for prompt injection - the model reads instructions and data through the same channel, by design. What works: least-privilege tool scopes, output handling as strict as you'd apply to user input, breaking the data-exfiltration leg (egress controls on what the model can render or call), a canary token in the system prompt as a leak tripwire, and logging prompts, completions, and tool calls as first-class audit events. Details and starter detections below.

Why this belongs in your threat model now

LLM-integrated applications stopped being experimental the moment they got access to real data and real actions. The clearest proof point so far is EchoLeak (CVE-2025-32711, CVSS 9.3): a zero-click vulnerability in Microsoft 365 Copilot disclosed in June 2025, where a crafted email could make Copilot exfiltrate data from the user's context - mail, files, chat history - without the victim ever clicking anything. Microsoft fixed it server-side, but the underlying class (researchers called it an "LLM scope violation") is exactly what this writeup is about: untrusted content steering a model that holds privileged access.

This wasn't novel in kind. Indirect prompt injection against LLM-integrated applications was described systematically by Greshake et al. back in early 2023, and it has been the number-one entry in every edition of the OWASP Top 10 for LLM Applications since the list existed. The gap between "known since 2023" and "still shipping exploitable in 2025's flagship products" tells you how hard the problem is - and why detection and blast-radius design matter more than waiting for a fix.

The Top 10, in one defensive pass

The 2025 list in production terms - what each risk looks like in an app you actually run, and the primary defense:

Risk	What it looks like in production	Primary defense
`LLM01` Prompt Injection	Untrusted content (user input, a retrieved doc, an email) steers the model's behavior	Least privilege + egress control; assume it happens
`LLM02` Sensitive Information Disclosure	Model reveals other users' data, internal docs, or credentials it was given	Data minimization; don't give the model secrets
`LLM03` Supply Chain	Poisoned or backdoored third-party model, adapter, or dataset	Pin versions, verify sources, inventory models like packages
`LLM04` Data & Model Poisoning	Attacker-influenced training or fine-tuning data changes behavior	Provenance and validation on anything you train on
`LLM05` Improper Output Handling	Model output flows into HTML, SQL, or a shell without encoding - classic injection, new source	Treat output as untrusted input; encode and parameterize
`LLM06` Excessive Agency	The model can call tools or APIs with more permission than the task needs	Scope tools per task; human approval for consequential actions
`LLM07` System Prompt Leakage	Instructions, and worse, any secrets inside them, get echoed out	No secrets in prompts; canary tripwire (below)
`LLM08` Vector & Embedding Weaknesses	RAG retrieval pulls attacker-planted or cross-tenant content into context	Access control at retrieval time, source tagging
`LLM09` Misinformation	Confident wrong answers drive real decisions	Grounding, citations, human review where it matters
`LLM10` Unbounded Consumption	Token-burning abuse: denial of service or a shocking bill	Rate limits, quotas, per-identity cost caps

Notice the pattern: half of these are old vulnerability classes wearing a new interface. LLM05 is unescaped output, LLM03 is supply chain, LLM10 is resource exhaustion. Your existing instincts apply. The genuinely new ones are LLM01, LLM06, and LLM08 - and they compound each other, which is where we go next.

Prompt injection, properly understood

SQL injection had a real fix: parameterized queries separate code from data. Prompt injection has no equivalent, because a transformer has one input channel. System prompt, user message, retrieved document, tool output - it's all tokens in the same context window, and the model's "instruction following" is a statistical tendency, not a security boundary. MITRE ATLAS catalogs this as AML.T0051, in two flavors that matter very differently for defense:

Direct injection - the user attacks the model they're talking to ("ignore your instructions"). Mostly a content-policy and abuse problem; the attacker only holds their own session.
Indirect injection - instructions hide in content the model processes on someone else's behalf: an email Copilot summarizes, a webpage an agent browses, a document RAG retrieves, a calendar invite. The victim never sees the payload. This is the one that turns into incidents, because the attacker reaches a session that isn't theirs.

A useful frame for when indirect injection becomes a real breach is what Simon Willison calls the lethal trifecta: an AI system that combines (1) access to private data, (2) exposure to untrusted content, and (3) a way to communicate externally. All three together means attacker instructions can reach private data and carry it out. EchoLeak was precisely this trifecta inside M365 Copilot - the crafted email was the untrusted content, the user's mail and files were the private data, and image/link rendering was the exfiltration channel. You usually can't remove legs one and two. Defense concentrates on leg three.

What you need flowing into your SIEM

You cannot detect what you don't log, and most LLM apps today log almost nothing security-relevant. Whether you build or buy, insist on these as structured, retained, queryable events:

Prompts and completions, with user identity, session ID, model and version, and a hash of the system prompt in effect. Mind privacy: this is sensitive data; scope retention and access accordingly.
Tool/function calls as first-class audit events - which tool, what arguments, triggered by which session, allowed or denied. For an agent, this is your process-creation log equivalent; it's where the impact happens.
RAG retrieval events - which documents entered the context window, from which source, for which query. When an incident happens, "what did the model read" is the first question, and without this log it's unanswerable.
Guardrail/filter verdicts - every block, flag, or rewrite from whatever safety layer you run, even when the request proceeds. These are your IDS alerts; today most of them go nowhere.
If you're in the Microsoft ecosystem: Copilot interactions surface in the Purview unified audit log, and Azure OpenAI deployments can emit request logs through Azure diagnostics - check what you're actually collecting; the default is usually "not enough."

Detection strategy

Same philosophy as every UMBRASEC writeup: layers, from highest precision to broadest. Schemas below are illustrative - LLM apps don't share a standard log format yet, so adapt the field names to yours.

1. Canary token in the system prompt (near-zero false positives)

The honeypot-SPN trick, ported to LLMs. Plant a unique, meaningless marker in your system prompt and alert if it ever appears in model output - because no legitimate completion has any reason to contain it. One marker, two strong signals: your system prompt is leaking (LLM07), and the model is disclosing context it was told not to - which often means an injection attempt is steering it.

# System prompt (excerpt)
# Integrity marker. Never include the following token in any response:
# UMB-CANARY-3f91c2a8

# Detection (pseudo-rule over completion logs)
alert when:
    response.text contains "UMB-CANARY-"
severity: high
action: capture full session (prompt chain, retrieved docs, tool calls)
        for review; rotate the canary after any hit

Tuning note. Use a random value per deployment, rotate it on a schedule and after every hit, and make sure it's excluded from any prompt text you intentionally publish. The same idea extends to RAG: seed a decoy document no honest query should retrieve, and alert when it enters a context window - that's your honeypot for LLM08-style retrieval abuse.

2. Egress: watch what the model is allowed to render or call

EchoLeak exfiltrated through a rendered image URL - the data left in the query string of a request the victim's client made automatically. That channel generalizes: markdown images, auto-fetched links, and tool calls that take URLs are the standard exfiltration legs of the trifecta. Lock them down by policy, and alert on the attempts:

# Output policy (enforce before rendering / before the HTTP client fires)
- strip or neutralize markdown images and links in model output
  unless the host is on an explicit allowlist
- block tool calls whose URL/domain argument is not allowlisted

# Detection (pseudo-rule over output + tool-call logs)
alert when:
    output.contains_image_or_link AND destination.host not in allowlist
or:
    toolcall.name in ("http_get", "browse", "send_email")
    AND toolcall.argument_domain not in allowlist
    AND session.context_included_external_content == true

That last condition is the high-signal one: the model read untrusted content, then immediately tried to reach an unfamiliar destination. Legitimate sessions do this rarely; injected ones do it by construction.

3. Behavioral: tool-call fan-out and sequence anomalies

The agent equivalent of the Kerberoasting fan-out rule. A copilot answering a question touches one or two tools; a hijacked agent enumerates - reading many documents, then calling a send/export/write tool it rarely uses. Baseline per tool and per identity, then alert on the outliers:

# Pseudo-aggregation over tool-call logs
alert when, within 5m, a single session:
    calls >= N distinct tools            # fan-out (baseline N first)
    or reads >= M distinct documents     # bulk context-stuffing
    or invokes a "consequential" tool    # send_email, create_rule,
       it has never used before          # export, delete, payment
severity: medium - this is a triage feed, not a pager rule

Run it report-only for a couple of weeks exactly like a SIEM correlation rule: find your noisy legitimate automations, allowlist them, and set thresholds above your noisiest honest session.

Mitigation: design beats detection here

More than in any other writeup on this site, the architecture is the control. In order of impact:

Scope tools to the task, not the product. A summarizer needs read access to one document, not a mailbox-wide search tool and an email sender. LLM06 is a permissions decision you control completely, today.
Put a human between the model and consequential actions. Sending mail, moving money, changing configs, deleting data - require approval. Keep the approval UI honest: show the actual action and arguments, not the model's summary of them.
Break the exfiltration leg. No unrestricted outbound fetches, no rendering remote images from model output, allowlisted domains for every tool that touches the network. If the trifecta can't complete, injection becomes much less interesting.
Keep secrets out of prompts. System prompts leak - assume LLM07 - so API keys, internal hostnames, and customer data don't belong there. The canary should be the only "secret" in your prompt, and it's designed to leak loudly.
Treat model output as untrusted input (LLM05): encode it before it hits HTML, parameterize it before it hits SQL, never pipe it to a shell. This one is fully solved technology; there's no excuse.
Inventory and pin your models like any other dependency (LLM03), and put rate/cost limits on every endpoint (LLM10).
For the formal backing when you write policy: the joint NCSC/CISA Guidelines for Secure AI System Development and NIST's AI RMF map cleanly onto everything above.

Honest limitations

There is no deterministic fix for prompt injection. Guardrail models and prompt classifiers are probabilistic - useful as a filter layer, bypassable by a motivated attacker. Anyone selling you a product that "solves" injection is selling you LLM09.
The detections here are younger than the rest of this site's content. Windows eventing has decades of field testing; LLM telemetry has a couple of years and no standard schema. Expect to adapt everything to your own logging, and expect this writeup to age faster than the Kerberoasting one.
The canary catches leakage, not all injection. An attack that quietly biases an answer, or exfiltrates without echoing the prompt, walks past rule #1. That's why the egress and behavioral layers exist, and why blast-radius design outranks all three.
Vendor-hosted copilots limit your visibility. For M365 Copilot you get what Purview exposes, not raw model I/O. Push vendors on logging the same way you'd push a SaaS provider on audit APIs - it's the same ask.

References

Scope note. This is a defensive writeup. It describes injection classes only to the depth a defender needs to log, detect, and contain them - it deliberately contains no working injection payloads, jailbreak strings, or evasion techniques. UMBRASEC publishes defense, not offense.