Blog
ragpii-redactiondata-governance

PII Handling in Retrieval-Augmented Generation Pipelines

PII handling in RAG pipelines

RAG pipelines are frequently described as a way to give LLMs access to private organizational knowledge without fine-tuning. That description is accurate, and it partially explains why RAG architectures have spread rapidly into regulated-industry deployments. What the description omits is that inserting a retrieval step between the user's query and the model's response introduces a PII surface that doesn't exist in stateless LLM interactions: the retrieved document chunk.

The challenge isn't only that documents in the retrieval corpus may contain personal data. It's that the assembly process — taking a user query, retrieving k relevant chunks, constructing the final prompt — creates a composite input where PII from the corpus and PII from the user's own input arrive at the model together, often without clear demarcation. Standard redaction approaches designed for simple user-input scanning don't handle this composite correctly.

The Three PII Surfaces in a RAG Pipeline

A complete PII risk analysis for a RAG deployment needs to address three distinct surfaces, each requiring different controls:

Surface 1: The User Query

The user's input may contain PII directly — a name, an account number, a date of birth entered to look up a record. This is the surface most teams are already thinking about. Standard Named Entity Recognition (NER) approaches like Microsoft Presidio can scan for person names, phone numbers, email addresses, US SSNs, and a range of financial identifiers. The limitation of NER-based approaches is that they're probabilistic, not deterministic. A person named "June March" may have their name classified as a month-month sequence rather than a PERSON entity. Partial names — "Smith from the Chicago branch" — often pass NER without triggering redaction.

Surface 2: The Retrieved Chunks

This is the surface most teams underweight. Documents in an enterprise knowledge base — policy documents, case files, client records, meeting notes — frequently contain personal data about people other than the user making the query. A customer service representative asking "what's the return policy for electronics?" retrieves a chunk that happens to contain a verbatim example with a real customer's name from a historical case. That name travels to the LLM as part of the assembled context. The model didn't receive PII from the user's query — it received it from the retrieval corpus.

For HIPAA-covered entities, this is the PHI surface that clinical document RAG deployments need to handle carefully. A 12-hospital health system deploying an LLM over clinical documentation for administrative staff — a scenario common in deployments we've reviewed — surfaces PHI not because clinical records were the target of the query, but because PHI is incidentally present in retrieved context across a large corpus. HIPAA Safe Harbor requires removing 18 categories of direct and indirect identifiers before information can be treated as de-identified. A retrieved chunk that contains a patient's date of service, zip code, and age in combination may not trigger any single NER entity class but still constitutes a quasi-identifier combination that fails Safe Harbor.

Surface 3: The Assembled Prompt

The assembled prompt that reaches the model is the synthesis of surfaces 1 and 2 plus the system instructions. PII from the user query and PII from retrieved chunks may reinforce each other — a retrieved chunk about "J. Smith, account #4521-xxx" combined with a user query mentioning "John Smith's account" creates a combined context that's more privacy-invasive than either piece alone, because the model can now link the two. This cross-surface linkage is not detectable by scanning each surface independently; it requires scanning the final assembled prompt as a unit before dispatch to the model.

Deterministic vs. Probabilistic Redaction: The Tradeoff

The fundamental tension in PII redaction for RAG is between deterministic and probabilistic detection methods. Deterministic approaches — pattern matching, allowlist/blocklist rules, structured field recognition — have high precision on the entity types they cover and zero false positives for their specific patterns (a regex for a 9-digit SSN in ###-##-#### format is correct or not). Their coverage is limited to the entity types explicitly enumerated.

Probabilistic approaches — NER models, transformer-based classifiers — have broader coverage across free-text entity types but nonzero false positive and false negative rates. In our early pilot work, NER miss rates on healthcare-relevant indirect identifiers (job titles in combination with departments, geographic sub-regions with age ranges) run meaningfully higher than on canonical entity types like phone numbers and email addresses. The miss rate isn't a model quality problem — it reflects the open-ended nature of indirect identifiers under GDPR Art. 25 data minimization principles and HIPAA Safe Harbor category 18 (any other unique identifying number, characteristic, or code).

We're not saying probabilistic redaction is inadequate. We're saying that probabilistic redaction alone, without a deterministic layer for high-confidence entity types and a review process for the residual, is not a defensible PII handling architecture for regulated deployments.

Embedding-Stage Redaction vs. Retrieval-Stage Redaction

There are two places in a RAG pipeline where PII redaction can occur: at indexing time (when documents are chunked and embedded into the vector store) and at retrieval time (when chunks are retrieved and assembled into the prompt). Both have merits; neither is sufficient alone.

Embedding-stage redaction (indexing time) removes or masks PII before the chunk enters the vector store. The benefit is that redacted chunks are embedded in a way that reflects the redacted content — the vector representation is of the scrubbed text, not the original. The limitation is that redaction decisions made at indexing time are fixed; if your redaction policy changes (new entity types added, Safe Harbor interpretation updated), you need to re-index. More importantly, embedding-stage redaction can change the semantic content of the chunk in ways that reduce retrieval accuracy — a medical record with patient identifiers redacted may no longer retrieve correctly for queries about that patient type, even when the query contains no identifying information.

Retrieval-stage redaction (at prompt assembly) applies redaction to chunks after they're retrieved, before they enter the assembled prompt. This preserves the original semantics in the vector store and allows policy changes without re-indexing. The limitation is that the original PII is present in the vector store and could be retrieved by a sufficiently direct query or, in poorly isolated multi-tenant deployments, by a cross-tenant query. Retrieval-stage redaction must therefore be combined with corpus access controls that prevent the retrieval step itself from becoming a PII exfiltration vector.

The architecture that handles regulated industry requirements most reliably combines both: embedding-stage redaction for the highest-sensitivity categories (SSNs, account numbers, PHI direct identifiers), and retrieval-stage redaction applied to all retrieved chunks before prompt assembly to catch residual PII that passed the indexing-stage scan.

Citation Provenance and the Privacy Problem It Creates

Enterprise RAG systems frequently include citation provenance — references to the source document and chunk that supported a generated answer. Citation provenance is valuable for auditability and for allowing users to verify AI-generated summaries. But it creates a secondary PII surface: if a citation points to a document chunk that contains personal data, the citation itself becomes a pathway to that data.

Citation handling in regulated deployments should redact source document identifiers that are themselves personal data (case file numbers that identify individuals, for example), limit citation depth to document-level rather than chunk-level when chunks contain sensitive data, and log citation provenance in the audit trail so that the full retrieval context can be reconstructed for incident investigation without being re-exposed in the user-facing response.

Vector Store Leakage and k-Anonymity Considerations

A less frequently discussed PII risk in RAG is vector store leakage through embedding inversion. While inverting a modern embedding to recover the original text is computationally expensive and not a practical concern for typical enterprise threat models, the vector store itself is a data asset that can contain PHI or PII-adjacent information in the indexed corpus. NIST 800-53 SC-4 (Information in Shared Resources) applies: data from one organizational unit should not be accessible via the vector index to another unit. Namespace isolation per tenant in vector stores (covered in more depth in our separate post on tenant isolation patterns) addresses this directly.

Beyond cross-tenant leakage, k-anonymity and l-diversity concepts from the de-identification literature are relevant for evaluating whether your retrieved-chunk corpus satisfies Safe Harbor. A chunk that alone doesn't identify an individual but that, combined with k-1 other retrievable chunks, becomes identifying, represents a quasi-identifier risk that single-chunk analysis won't catch. This is a research-level problem for full automated analysis, but for high-sensitivity corpora (clinical notes, HR records, financial case files), periodic human review of the retrieval results for common query patterns is a practical risk mitigation approach.

What the Final Assembled Prompt Should Look Like

After retrieval-stage redaction, the assembled prompt that reaches the model should have the following properties:

  • No direct identifiers: names, email addresses, phone numbers, SSNs, account numbers replaced with typed placeholders ([PERSON], [ACCOUNT_NUMBER])
  • Quasi-identifiers reviewed for combination risk: age + zip + date-of-service combinations flagged for the healthcare context; job title + department + salary band flagged in HR contexts
  • Redaction metadata captured in the audit log: entity types and counts detected and masked, so compliance teams can verify the layer was functioning without needing to access the original content
  • Prompt hash logged at the policy plane before dispatch: this is the evidence that the scrubbed prompt — not the original — was what the model received

For an Am Law 50 firm deploying a matter research assistant over client documents, the assembled prompt that reaches the model should contain no client-identifying information unless the lawyer has explicitly included it in their query and the policy configuration for that use case permits client PII in the model context. The retrieval corpus may reference those clients extensively; the redaction layer is what ensures the distinction between "this document discusses a client" and "this prompt sends that client's data to a third-party model API."

Building this correctly requires treating the policy plane — the layer that assembles, scans, and dispatches prompts — as the enforcement point, not the individual application that calls it. When PII handling is implemented application-by-application, it drifts. Policy gaps appear when a new application is added that doesn't inherit the same redaction configuration. A centralized proxy layer where all LLM calls must pass enforces consistent PII handling across the full set of applications and use cases without requiring each application team to implement it independently.