Audit Log Design for AI Systems: What Auditors Actually Ask

Kevin McGrath June 12, 2025 11 min read

When an enterprise compliance team asks for documentation of your AI system's behavior, the first question isn't "does your model produce good outputs?" It's "can you show me what it processed, when, for whom, and under which version of your configuration?" That's a logging and evidence-preservation question, not an AI quality question. And in our experience working through compliance reviews with data-governance teams, most LLM deployments fail it before anyone looks at model behavior at all.

This post catalogs the log fields that auditors actually ask for — not what engineering teams assume auditors want, but what surfaces in practice when a compliance inquiry hits an AI deployment in a regulated environment.

What Auditors Are Looking For (And Why It Differs from Engineering Logs)

Engineering logs are designed for debugging: stacktraces, latency metrics, error codes. Compliance audit logs are designed for evidence preservation: they need to answer who, what, when, and whether the system was operating within approved parameters at the time. These are different design goals.

Consider a scenario that reflects patterns we've reviewed: a growing financial institution deploys an LLM summarization tool for wealth advisors. Six months after go-live, an OCC examination team conducting a horizontal review under Heightened Standards asks for records of what client data the model processed during a specific two-week window. The engineering logs contain request IDs and latency percentiles. They don't contain the assembled prompt (scrubbed of the actual data before logging for "privacy reasons"). The tenant-scoped user identifier wasn't captured. There is no way to reconstruct what the model received, so there is no way to demonstrate that PII handling controls were operating correctly at the time. The gap is not a model problem. It's an audit log design problem.

The Core Schema: Fields Auditors Ask For

Based on what we've catalogued across financial services, healthcare, and public sector compliance reviews, these are the fields an LLM audit log must contain to survive scrutiny:

Immutable Identity Fields

request_id: A globally unique identifier for the LLM call. Not application-level — this must be generated at the proxy layer before the call is made, so that request ID is consistent across all log systems.
tenant_id: The organizational or departmental tenant making the call. In multi-department deployments, this is the field that lets auditors scope their review to the relevant organizational unit.
user_id: The authenticated user, not the application service account. Auditors reviewing financial services deployments under FINRA Reg Notice guidance on electronic communications frequently ask for user-level traceability, not just application-level.
session_id: Groups multiple turns of a conversation. An individual request_id doesn't reconstruct a multi-turn exchange; session_id does.

Temporal and Version Fields

timestamp_utc: UTC epoch, not application-server local time. Auditors working across timezones need a single reference frame.
model_version: The exact model identifier (including provider version string). Model behavior changes between versions; knowing which model was running at the time of a particular output is essential for incident reconstruction.
system_prompt_version_hash: A SHA-256 hash of the system prompt configuration active at the time of the call. This is one of the most commonly missing fields. Without it, you cannot demonstrate that the approved policy configuration was in effect when a specific call was processed.
policy_config_version: Separate from the system prompt hash — the version tag of the redaction and routing policy configuration.

Payload Fields (with Redaction)

prompt_hash: SHA-256 of the pre-model assembled prompt (post-redaction). Auditors don't necessarily need to read the prompt; they need to be able to verify that a specific prompt was processed. The hash provides that without re-exposing the data.
redaction_entities_detected: A count and type list of PII/PHI entities detected and redacted before the call was made. This is what demonstrates to HIPAA compliance teams that the redaction layer was functioning.
response_hash: SHA-256 of the model response. Same logic as prompt_hash — proves the response was captured without necessarily re-exposing content.
output_filter_result: Whether the output was flagged, modified, or blocked by any post-generation policy. This is the field that demonstrates the control was active, not just configured.

Immutability: The Non-Negotiable Requirement

ISO 27001 A.12.4 (Logging and Monitoring) requires that event logs be protected against tampering and unauthorized access. SOX 404 IT General Controls (ITGCs) treating AI systems as in-scope for financial reporting require that audit evidence be integrity-demonstrable. The field auditors ask for, and that most teams haven't implemented, is an immutability proof — not just that the log exists, but that it hasn't been modified since the event was written.

In practice, this means one of two architectural patterns:

Append-only Postgres with S3 Object Lock: Write the log entry to an append-only Postgres table (row-level insert, no updates, no deletes), then archive a chained-hash snapshot to S3 with Object Lock (WORM — Write Once Read Many) storage enabled. The chain works as follows: each log entry includes a prev_hash field containing the SHA-256 of the previous entry. Auditors can verify that no entry has been deleted or modified by re-computing the chain from any arbitrary point forward.

Dedicated WORM log storage: Some regulated industries prefer a purpose-built append-only log store rather than a general-purpose database. The key requirement is the same: write-once semantics, cryptographic proof of non-tampering, and a retention period that matches the regulatory mandate. For financial services under FINRA Reg Notice guidance, a three-year minimum is standard; for healthcare under HIPAA, six years for covered entity records.

We're not saying that a structured application log in a standard database is worthless. We're saying that without immutability proof, it doesn't satisfy the evidence-preservation requirement that compliance teams will ask about — and retrofitting immutability onto an existing logging architecture after a compliance inquiry begins is expensive and sometimes impossible.

What a Chain-Hash Entry Looks Like in Practice

A minimal chained-hash audit record for an LLM call, in JSON form:

{
  "request_id": "req_01j2k3m4n5p6q7",
  "tenant_id": "wealth-advisory-east",
  "user_id": "usr_b3c9d12",
  "session_id": "sess_x7y8z9",
  "timestamp_utc": 1747286400,
  "model_version": "gpt-4o-2024-08-06",
  "system_prompt_version_hash": "sha256:a1b2c3d4...",
  "policy_config_version": "v2.3.1",
  "prompt_hash": "sha256:e5f6a7b8...",
  "redaction_entities_detected": {"PERSON": 2, "ACCOUNT_NUMBER": 1},
  "response_hash": "sha256:c9d0e1f2...",
  "output_filter_result": "PASS",
  "prev_hash": "sha256:9a0b1c2d..."
}

The prev_hash field chains each entry to the previous one. Any deletion or modification of a historical record breaks the chain, which is detectable by re-computing the SHA-256 forward from any known-good starting point. Auditors can verify the chain without accessing the underlying payload content — which means you can give them chain-verification access without giving them access to user data.

Retention, Queryability, and the Auditor's Actual Workflow

A log that can't be queried by tenant_id + date_range in under 30 seconds during a live audit is effectively useless, even if it's immutable. The most common operational failure we see after the schema problem is fixed is that the log exists in S3 but there's no query interface. An auditor who asks "show me all AI calls made by the wealth-advisory-east team between March 1 and March 15" cannot wait for a data engineer to write a one-off Athena query.

The log schema should have indexes on tenant_id, user_id, session_id, and timestamp_utc from day one. The query interface — whether it's a web UI, a CLI, or an API endpoint — should be available to compliance personnel without requiring engineering support for standard audit queries.

The Audit Trail as Operational Infrastructure

Teams that build LLM audit logs solely to satisfy a compliance checklist tend to build them too late, too thin, and in a way that's disconnected from the operational monitoring their engineers actually use. The more sustainable posture is to treat the audit trail as primary infrastructure — the record from which both engineering observability and compliance evidence are derived.

That design decision (single source of truth, dual consumers: engineering + compliance) avoids the most common failure mode we've seen: two separate log systems that drift out of sync, one for engineering dashboards and one for compliance submissions, with no reliable way to prove they describe the same events. When an auditor asks for evidence of a specific incident, the engineering log says one thing and the compliance log says another. That is a worse outcome than having only one log.

Meibel's audit trail captures the exact fields described in this post by default. See the audit trail or request access.

Prompt Injection Defense Next: EU AI Act Readiness