Most AI-safety advice stops at principles. "Detect sensitive data." "Keep an audit trail." "Don't send private data to third parties." Few things are wrong with the principles — the gap is that a principle you cannot enforce on every code path is a wish, not a control. A redaction rule one route skips, an audit log an admin can rewrite, a "local-only" flag a background job ignores: each reads as safety on the page and leaks in production.
This is a framework for operating LLMs safely that is organized around enforceability. It has seven pillars. For each one we state the principle (the one-sentence guarantee), the failure mode (what goes wrong when it is absent), and the reference control (how a gateway actually implements it). It is principle-based on purpose — it describes the failure a safe system must prevent, never a specific incident.
We build an open-source gateway, Membrain, that implements each of these, so "is this enforceable?" has a concrete answer rather than a hand-wave. Where a pillar names a control, the Membrain implementation is noted in parentheses. You do not need to run Membrain for the framework to be useful — the seven questions stand on their own.
The single test that runs through all seven pillars: a control is only as strong as its weakest covered path. Ask of every rule — on which paths does it run, what happens when it errors, and who can bypass or forge it?
The Seven Pillars
Detection — know what is in the traffic
Principle. Sensitive content (PII, secrets, regulated data) and risky content (prompt injection, tool-description poisoning) must be identified in both directions before it crosses a trust boundary.
Failure mode. What you cannot see, you cannot govern. Undetected secrets ship to third-party providers; undetected injection rewrites the agent's intent. Exfiltration happens on the response path too, so scanning only the request is half a control.
Normalize before you match (Unicode NFKC, encoding variants) — homoglyph and width tricks slip naive patterns. Combine deterministic patterns with ML/NER and take the union of overlapping matches, never just the first. Let users register exact-match "guarded values" that are always caught regardless of confidence thresholds.
Reference control: a scanning service over a tunable pattern + NER engine with guarded-value overlays (Membrain: PIIService / PIIScanner, 25+ categories).
Enforcement — act on what you detect
Principle. Detection without action is theater. Every finding maps to a declared action: pass, log, alert, redact, confirm, or block.
Failure mode. Computing a redaction and then forwarding the original; logging a "blocked" event while the request proceeds. The audit says safe; the wire says leak.
Make policy per-tenant and declarative (categories → actions). Fail closed: if the scanner errors, reject — do not forward. Enforce the redacted artifact on the wire, so what you logged equals what you sent. No "skip" mode bypasses non-negotiables — registered secrets stay redacted in every mode.
Reference control: a middleware pipeline that mutates the outgoing payload and can short-circuit a request (Membrain: rate-limit → budget → PII/data-policy → tool-policy → cache → knowledge).
Memory — govern what the system remembers
Principle. Stored context — RAG knowledge, caches, transcripts — is an attack surface and a data-residency obligation, not a free convenience.
Failure mode. Private prompts embedded into a third-party vector store; one tenant's memory surfaced to another; secrets persisted in a shared cache.
Re-scan content for sensitive data before it is written to any store. Scope every read and write by tenant, and where relevant by actor. Treat "private/local-only" as covering the memory-write hop, not just inference — never seed a shared cache or embed off-box for a request marked private.
Reference control: a knowledge store with per-tenant scoping, PII re-scan on inject, and egress gating tied to the privacy flag (Membrain: KnowledgeStore, semantic cache).
Visibility — be able to prove what happened
Principle. Every AI interaction — model, tokens, cost, findings, tools, actor — leaves a tamper-evident trail, and unsanctioned AI use is discoverable.
Failure mode. No record of what data left the building; "shadow AI" tools no one approved; metrics you can't trust because the log can be rewritten.
Audit every exit path — success, fail-closed rejection, and upstream error alike. Make the trail tamper-evident (keyed hash chain + external anchor), not merely append-only. Surface shadow-AI usage by tool, endpoint, and actor. Keep raw sensitive values out of audit rows; gate any plaintext reveal behind strong auth and tenant scope.
Reference control: structured audit with an HMAC-keyed, sequence-bound chain plus shadow-AI detection (Membrain: audit service, MCP audit, shadow endpoints).
Routing — control where requests go
Principle. The destination of a request is a security decision. Privacy, residency, and cost constraints must bind the actual egress, including fallbacks.
Failure mode. A "local-only" request that reaches a cloud provider through a fallback chain, a default sentinel, or a side channel (embedder, cache, tagger).
Make privacy/residency a property of the pipeline context, not a local variable, so every egress-capable component honors it. Re-apply constraints to fallback targets, not just the primary. Fail closed (e.g., a hard 502) when no compliant route exists — never silently downgrade. "Local model" does not cover embedders, caches, or background jobs; each is its own egress.
Reference control: a router with explicit local-provider sets, fallback re-validation, and a propagated egress flag (Membrain: Router, private-egress guard).
Coverage — no path is exempt
Principle. A control that protects one route and not another is a false sense of security. Every surface gets the same governance, or the weakest path defines your posture.
Failure mode. The application route redacts; the transparent proxy doesn't. The OpenAI path is governed; the Anthropic path drifts. Attackers find the gap.
Route every surface through one shared enforcement entry point — avoid parallel, divergent implementations. Test the contract per-surface against the database and runtime you actually deploy, not a lighter test double. Treat "skip" and "no-op" branches as security-relevant; that is where coverage silently ends.
Reference control: a single shared pipeline consumed by all entry points, per-surface contract tests, and CI that runs against the real datastore.
Trust — identity, isolation, and integrity
Principle. Multi-tenant boundaries, operator identity, and the integrity of the controls themselves — certs, keys, audit chain, supply chain — are the foundation the other six stand on.
Failure mode. A tenant admin acting cross-tenant; an unconstrained interception CA with its key on disk; a security fix that never reaches installed clients; a forgeable audit chain. Break trust and the other six pillars are decorative.
Separate global authority from tenant authority; clamp every read and write to the caller's scope unless explicitly global. Name-constrain any interception CA and destroy its private key after use. Key integrity material (audit HMAC, signing keys) outside the database the data lives in. Make the update path itself a control — a fix that doesn't ship isn't a fix.
Reference control: a role hierarchy with global-vs-tenant scoping, a name-constrained proxy CA, an externally-keyed audit chain, and a guarded publish pipeline.
How to use the framework
Two of the seven pillars are force-multipliers. Coverage (pillar 6) and Trust (pillar 7) determine whether the other five are real: a perfect redactor on one of three ingress paths is a third of a control, and an audit chain an admin can forge is none. If you audit in order, audit those two first.
- As an operator: treat each pillar as a checklist and find your weakest one — that, not your strongest, is your actual posture.
- As an assessor: for each control, ask "on which paths?", "what happens on error?", and "who can forge or bypass it?" A control that fails any of the three is incomplete.
- As a builder: wire detection, enforcement, memory, routing, and audit as one pipeline that shares state — the cross-cutting guarantees (PII-clean audit trails, privacy-bound routing) are properties of that architecture, not features you bolt on later.
These are guidelines, not a standard. Authority over how to run AI safely is earned by adoption, not declared — so treat this as v0.1, and tell us where it's wrong.
See the framework as running code
Membrain is the open-source reference implementation — each pillar maps to a control you can read and run yourself. Self-hosted, Apache-2.0.
Get started on GitHub →