SSE Redaction: Defending LLM System Prompts in Streaming Architectures

By Better ISMS — February 2026

If you're building a product on top of an LLM, your system prompt is your product logic. When someone extracts it, they get your reasoning, your guardrails, your competitive edge — everything. And if you're streaming responses via Server-Sent Events (which you probably are), defending against extraction is harder than you think.

This post describes SSE redaction, a technique we built for ISMS Copilot to detect and neutralize system prompt leaks mid-stream. We're sharing the architecture so others building LLM products can implement something similar.

The Problem

Most LLM applications stream responses to the client chunk by chunk using SSE. Each chunk is sent the moment it's generated. There's no "review the full response before sending" step — that would defeat the purpose of streaming.

This creates a security gap: if a jailbreak prompt convinces the model to dump its system instructions, the content is already flying to the client before you can stop it. By the time you realize what happened, the user has seen hundreds or thousands of characters of your system prompt.

Traditional output filtering doesn't work here. You can't buffer the entire response (latency kills UX), and you can't check each tiny chunk in isolation (a 5-word fragment doesn't look like a system prompt).

For general jailbreak prevention strategies, see Mitigate Jailbreaks and Prompt Injections. SSE redaction is a defense-in-depth measure for when those preventions fail.

The Architecture

SSE redaction works in four stages.

Stage 1 — Fingerprinting. Before any conversation happens, you extract a set of fingerprint phrases from your system prompt. These are distinctive strings that would only appear together if the model is reproducing its instructions. You want phrases spread across different sections of your prompt — role definitions, constraint names, behavioral rules. The number of fingerprints and the matching threshold are tunable parameters you keep secret.

Stage 2 — Accumulation and periodic checking. As the model streams chunks, a guard accumulates the full response text. At regular intervals (measured by character count, not by chunk), it checks the accumulated content against the fingerprint set. Checking every chunk would be wasteful — the fingerprints need enough surrounding context to match meaningfully.

Stage 3 — Error propagation. When the guard detects enough fingerprint matches, it throws a typed error (in our case, SystemPromptLeakError). This is where the subtlety lives. In a streaming architecture, the chunk-processing loop typically has a try/catch for handling malformed SSE data (bad JSON, unexpected formats). That generic catch block will swallow your security error if you're not careful. You need a guard clause that re-throws your specific error type before the generic handler runs:

catch (e) {
  if (e instanceof Error && e.name === 'SystemPromptLeakError') throw e;
  // generic error handling continues for everything else
}

This is a one-liner, but without it, the entire detection system is inert. The guard fires, logs the detection, and the stream continues happily delivering your system prompt to the attacker. We learned this the hard way — our guard was detecting leaks perfectly in logs while doing absolutely nothing to stop them.

Stage 4 — Redaction. Once the error propagates up to the stream controller, it sends a redact SSE event to the client. The client replaces whatever was already rendered with a refusal message. The server simultaneously replaces the stored content in the database so the leak doesn't persist.

What the User Sees

The attacker briefly sees partial streamed content — maybe a few seconds worth — then the entire response gets replaced with a generic refusal. The experience is: text appears, then vanishes and is replaced. The partial content they glimpsed is incomplete and mixed with normal response text, making it unreliable for extraction.

Learn more about how refusal messages work in Handle Refusals and Scope Limits.

The Catch Block Problem

This deserves emphasis because it's the kind of bug that passes every test but fails in production.

If you're using async generators for streaming, your SSE parsing loop probably looks like this:

for (const line of sseLines) {
  try {
    const data = JSON.parse(line);
    const text = extractText(data);
    await onChunkCallback(text);  // <-- guard runs here
    yield text;
  } catch (e) {
    console.error('Error parsing chunk:', e);
    // continues to next line
  }
}

The callback is inside the try block. If the guard throws, the catch logs it as a parse error and moves on. In our case, the guard detected the leak correctly on every single chunk after the threshold — the logs showed SystemPromptLeakError firing repeatedly — while the stream completed normally, saved the full leaked prompt to the database, and sent it to the client.

The additional complication: this behavior is runtime-dependent. In Node.js, async generator errors from callbacks can propagate differently than in Deno. Our tests passed in the Node.js test environment because the error happened to propagate. In Deno production, it was swallowed. If you're building this, test in your actual production runtime, not just your test runner.

Design Decisions Worth Noting

Why fingerprints instead of embedding similarity or exact matching? Fingerprints are fast (string matching), deterministic (no model calls), and robust against paraphrasing. The model rarely paraphrases its own system prompt during a leak — it reproduces it verbatim or near-verbatim. Embedding similarity adds latency per check and introduces false positive risk on legitimate compliance content. Exact substring matching is too brittle (whitespace, formatting differences).

Why check periodically instead of every chunk? Chunks are small (often 3–10 characters). A single chunk is meaningless for detection. Accumulating to a minimum threshold before checking reduces computation and ensures enough context for reliable matching.

Why not buffer the entire response? Buffering kills the streaming UX. Users expect to see text appear in real-time. A 2-second buffer is noticeable; buffering a full 4000+ character response is unacceptable. SSE redaction preserves real-time streaming for 99.99% of conversations and only intervenes during an active leak.

Why replace in the database too? If you only redact on the client, the leaked content persists server-side. Anyone with database access, any export feature, any conversation history endpoint would expose it.

What This Doesn't Solve

SSE redaction is a defense-in-depth measure, not a silver bullet.

It doesn't prevent the model from attempting to leak. That's what your system prompt's own instructions handle (explicit refusal instructions, constraint sections). SSE redaction is the safety net for when those instructions fail — and with enough creativity, jailbreaks do occasionally succeed.

It doesn't prevent leaks shorter than the detection threshold. If someone coaxes the model into revealing a single sentence of the system prompt, the fingerprint count won't hit the threshold. This is by design — you're trading off between catching full extractions (high confidence) and flagging partial mentions (high false positive risk).

The attacker does see partial content before redaction. For a few seconds, streamed text is visible. This is inherent to streaming architectures. The partial content is incomplete and lacks structure, but it's not zero exposure.

SSE redaction complements but doesn't replace system prompt security best practices. See System Prompts and Protect Workspace and Custom Instructions for foundational security measures.

Implementation Checklist

If you want to build this for your own LLM product:

Extract fingerprint phrases from your system prompt — choose distinctive, section-spanning strings.
Build a guard that accumulates streamed content and checks periodically against fingerprints.
Define a typed error class with a distinctive name for leak detection.
Audit every catch block in your streaming pipeline — add re-throw guards for your error type.
In your stream controller, handle the error by sending a redact event and replacing stored content.
On the client, handle the redact event by replacing rendered content with a refusal message.
Test in your production runtime, not just your test runner.
Keep your fingerprints, thresholds, and check intervals secret.

Closing Thought

The hardest part of this wasn't the detection algorithm — it was a one-line bug in a catch block that silently disabled the entire system. Security in streaming architectures fails at the plumbing level, not the algorithm level. If you're building LLM security features, trace the full error path from detection to user-facing action, and verify it in your actual production environment.

For a broader view of AI safety practices at ISMS Copilot, see AI Safety & Responsible Use Overview.

Better ISMS builds compliance tooling for information security teams. ISMS Copilot is our AI assistant for ISO 27001, SOC 2, GDPR, and related frameworks.

Was this helpful?