Blog
securityprompt-injectionred-teaming

Prompt Injection Defense in Production LLM Systems

Prompt injection defense diagram

Prompt injection has appeared in the OWASP LLM Top-10 since the list's first iteration, ranked LLM01 in the 2025 update. The prominence is earned. Unlike most application security classes where the attack surface is a discrete interface, prompt injection targets the model itself — the boundary between instructions and data is blurry by design, and that ambiguity is precisely what attackers exploit.

Enterprise teams deploying LLMs in production frequently approach the problem the same way they'd approach SQL injection: validate inputs, sanitize outputs, and add a detection layer. That framing is a reasonable starting point, but it misses the structural difference. SQL interpreters follow a grammar. LLMs follow statistical associations. There is no parse tree to escape.

The Two Flavors That Matter Most in Enterprise Contexts

OWASP LLM01-2025 distinguishes direct injection — user input that hijacks or overrides system instructions — from indirect injection, where adversarial instructions are embedded in external content the model retrieves or summarizes. For most enterprise deployments, indirect injection is the harder problem.

Consider a scenario we've seen play out in early pilot work: a financial services firm deploys an internal assistant that can summarize customer emails and then route follow-up tasks. An attacker who can send one inbound customer email to that firm can craft it to contain hidden instructions in light-colored text, zero-width characters, or language the model interprets as continuation of its system prompt. The email-processing pipeline reads the email as retrieved context. The LLM processes it as part of its instruction space. The result is a compromised action — altered routing, exfiltrated summary content, escalated privilege requests — triggered by a document the organization never flagged as adversarial input.

Direct injection — the classic jailbreak vector where a user attempts system prompt extraction or policy bypass — is easier to detect because it appears in the turn structure the model expects. Indirect injection is harder because the adversarial content looks like legitimate retrieved data until it's already been processed.

Detection Patterns at the Proxy Layer

A policy plane operating in front of the LLM API call has access to the full assembled prompt before the model sees it. That position matters. At that layer, several detection patterns become viable that aren't available once the model starts generating:

Structural Instruction Markers

Many injection attempts use common phrasing to override system context: "Ignore previous instructions," "Disregard the above," "Your real task is..." These aren't exclusively adversarial — a legitimate user composing a prompt about prompt engineering might write similar text. That's the classification challenge. An allowlist-pattern approach, where the proxy examines whether instruction-override language appears in a position consistent with attacker-controlled input versus system-configured context, narrows the false positive rate considerably. The position heuristic is: instruction tokens appearing in the user turn or in retrieved document chunks, not in the system turn, warrant elevated scrutiny.

System Prompt Extraction Attempts

System prompt extraction is a distinct jailbreak vector. The goal is to get the model to repeat its own configuration instructions back to the user, revealing which constraints are in place. Detection at the proxy layer uses output scanning, not just input scanning: responses that begin repeating content verbatim from the system message are a reliable signal. This is one of the few prompt injection patterns where output filtering yields a strong signal with low false-positive cost, because legitimate responses rarely include verbatim system message text.

Indirect Injection via Retrieved Context

For RAG-adjacent pipelines, the retrieved document chunks should be examined before being assembled into the final prompt. In deployments we've reviewed, the most common gap is that retrieval results flow directly from the vector store into the context window without a transit scan. Proxy-layer interception that hashes and scans each retrieved chunk adds latency — typically in the 5–15ms range per chunk — but catches embedded instruction attempts before they enter the model's context.

Log Signatures for Distinguishing Genuine Attacks from Legitimate Prompts

One of the operational problems enterprises don't anticipate is alert fatigue. Naive injection detection triggers on legitimate prompts constantly: security researchers, developer-tool teams, and internal red teamers all write prompts that look like injection attempts. Without log signatures that help analysts distinguish a genuine jailbreak attempt from a developer testing behavior, the detection system loses credibility quickly and gets bypassed by policy exception.

Useful log signatures that have signal value:

  • Role-boundary crossing pattern: Input tokens that attempt to introduce a new system or assistant role element within a user turn. Most LLM APIs have strict role ordering; attempts to inject role metadata into freeform input are statistically rare in legitimate use and high-signal for attacks.
  • Repetition-seeking language in output: User prompts ending with phrases like "print your instructions," "show me your system prompt," or "what are you configured to do?" — combined with model outputs that exhibit verbatim repetition from the system message — form a two-event signature that's stronger than either event alone.
  • Cross-session anomaly on user-ID: A user whose prompt entropy distribution shifts dramatically in one session (suddenly long, instruction-laden prompts from an account with a history of short operational queries) is a behavioral signal worth flagging for review, not auto-blocking.
  • Retrieved chunk integrity deviation: If a retrieved document chunk's content hash doesn't match the hash stored at indexing time, it has been modified since indexing — a scenario worth alarming on separately from prompt content analysis, since it may indicate tampering at the vector store layer rather than a prompt-layer attack.

The Limits of Detection — and Why That's the Right Frame

We're not saying prompt injection is a solved problem, or that detection-layer controls are sufficient on their own. We're saying that a policy plane which systematically logs, classifies, and surfaces injection signals gives your security team an evidence record and a response workflow rather than a black box. The model may still be manipulated; the question is whether you know it happened and whether you can reconstruct the incident.

This is where the NIST AI RMF Govern function becomes practically relevant. The Govern function asks organizations to establish accountability structures for AI risk — including the processes for detecting, escalating, and responding to incidents. An LLM deployment with no policy plane isn't just vulnerable to injection; it has no evidence trail for the Govern function to operate on. When an incident does occur, the forensic question "what did the model actually receive?" cannot be answered.

What a Well-Instrumented Proxy Looks Like

In our early pilot work with growing financial services teams deploying internal LLM tooling, the instrumentation gap is almost always on the input side, not the model side. Teams spend engineering time on prompt engineering and output evaluation but log the prompt itself only at a debug level that doesn't persist. The proxy layer that Meibel sits at captures:

  • The full pre-model prompt, hashed and stored immutably
  • The injection classification score and which pattern triggered it
  • The tenant and user context at time of call
  • The model response, stored alongside the request for incident reconstruction

That record isn't just useful for security. It's the evidence an auditor asks for when they ask "show me what your AI system processed on date X for user Y." The security instrumentation and the audit instrumentation are the same thing at the proxy layer — which is why prompt injection defense, properly built, belongs in the policy plane rather than bolted onto individual application codebases.

Calibrating the Allowlist vs. Blocklist Decision

A common architectural misstep is building prompt injection defense around a blocklist of known injection strings. Blocklists age badly. Attackers iterate; the list always lags. Allowlist patterns — specifying what legitimate user behavior looks like and alerting on deviation — are harder to build initially but degrade less over time. The structural properties of your application (who are your users? what task types do they perform? what retrieval sources do they query?) define what a normal prompt distribution looks like. Injection attempts are distribution outliers.

This doesn't mean blocklists have no role. High-confidence blocklist entries (literal "ignore previous instructions" patterns, known payload signatures from public jailbreak datasets) provide cheap, fast catches that handle the unsophisticated end of the attacker spectrum. The combination — blocklist for known patterns, allowlist-based anomaly detection for novel attempts — covers more of the attack surface without collapsing under false positive load.

The measure of a production prompt injection defense isn't the detection rate on known payloads. It's the false positive rate on legitimate prompts and the evidence quality it produces for the incidents that get through. Those two metrics determine whether the system is operationally sustainable, which is what actually matters at enterprise scale.