Give a gateway long-term memory — a knowledge base it can search and inject into future prompts — and you quietly change the threat model. A normal request is transient: it is scanned, it is answered, it is gone. A remembered fact is not. Once something is stored and made searchable, it is recalled into many later prompts, for many later users, and the model treats it as trusted context. Memory amplifies: one write becomes many reads. So the question "what is allowed to enter the memory?" deserves at least as much governance as "what is allowed to leave the building?"
This is the Memory pillar from our framework for operating LLMs safely taken seriously. There we stated the principle — stored context is an attack surface and a data-residency obligation, not a free convenience. This post is the architecture underneath it: the concrete choices that decide whether a memory layer is trustworthy.
Auto-extraction sharpens the asymmetry. If the gateway mines conversations for durable facts and stores them on its own, then any text that flows through — a user's offhand claim, a poisoned tool description, a planted instruction — can become a standing entry that is silently injected into everyone's future prompts. That is a memory-poisoning surface, and it is created by convenience, not malice. The fix is not to abandon auto-extraction. It is to put a human between capture and recall, and to make the surrounding machinery behave.
A memory layer is only as trustworthy as its weakest of four guarantees: what enters it is reviewed, what you reject stays rejected, what is read is scoped to the reader, and what is injected is idempotent. Miss any one and the other three leak.
Four design choices
Pending by default — review before recall
Principle. Auto-extracted memory enters a pending tier and is invisible to search and injection until it is approved — by a reviewer, or by an explicit policy that says this tenant trusts auto-capture.
Failure mode. Writing straight into the searchable set means the first time anyone learns a bad fact was stored is when the model repeats it back. By then it has been recalled into who-knows-how-many prompts. Review-after-the-fact is cleanup; review-before-recall is a control.
Make the mode a per-tenant choice, not a global default. Surface mode auto-approves but keeps every entry visible and reversible — low friction, for teams that trust their own traffic. Strict mode holds extractions in the queue until a human promotes them — for tenants where a wrong remembered fact is expensive. The capture moment is the cheapest place to put a person; spend it there.
Reference control: the existing KnowledgeEntry.status pending tier, a dashboard review queue (promote / reject), and a knowledge_approval_mode toggle. Search and injection read only approved (Membrain: KnowledgeStore).
Rejections must stick — durable tombstones
Principle. Rejecting a fact records a durable decision, not just a deletion. The same content re-extracted later is suppressed before it ever reaches the queue again.
Failure mode. If "reject" only deletes the row, the next conversation that mentions the same thing re-creates it, and the reviewer is on a treadmill — declining the same entry forever. A reviewer who is ignored stops reviewing.
Key the decision to the content, not the row: a hash of the normalized text, scoped to the tenant. Honor real IS NULL semantics so a global (tenant-less) rejection and a tenant-scoped one don't collide. Re-extraction checks the tombstone and skips silently.
Reference control: a knowledge_tombstones table keyed by (content_hash, project), checked in the extraction path (Membrain: KnowledgeStore.reject / is_tombstoned).
Reads are scoped — and review actions are clamped
Principle. Search and injection return only approved entries belonging to the caller's tenant. And the review actions themselves — promote, reject — cannot reach across tenant boundaries.
Failure mode. The obvious leak is one tenant's memory surfacing in another's prompt. The subtler one is a reviewer in tenant A approving a pending entry that belongs to tenant B — a write-side cross-tenant action that a read-side filter never sees. Both have to be closed.
Scope every read by tenant and status together. Clamp every mutation to the entry's own tenant: a promote or reject that targets an entry outside the caller's scope is refused, not silently applied. This is the Trust pillar applied to memory — global authority and tenant authority are different things.
Reference control: approved-and-project-scoped reads, and project-clamped promote / reject (cross-tenant → refused) verified against a real Postgres in CI, not a mock.
Injection is idempotent — one managed block
Principle. Approved context is injected as a single managed block that is replaced, not appended, on every turn. The conversation carries one block of organizational context regardless of how long it runs.
Failure mode. Naive injection appends the relevant facts to each message. Over a long conversation the same context stacks up turn after turn: tokens balloon, cost climbs, and the model re-reads duplicate context that dilutes its attention and can crowd out the actual task. Worse, stale copies linger after the underlying fact changes.
Wrap injected context in a single, recognizable managed block. Before injecting, strip any prior managed block, then write the current one. The transform is idempotent: injecting twice yields the same result as once. The model sees current context exactly once.
Reference control: a single managed <organizational_context> block, strip-then-replace on each injection (Membrain: knowledge middleware).
The trusted-write exception
Not every write is untrusted. When a developer calls POST /v1/knowledge or /ingest, that is a deliberate, authenticated act — the human is already in the loop, by definition. Those writes are stored approved directly. The distinction is provenance: facts the system mined for itself are pending by default; facts a person deliberately submitted are trusted.
Trusted does not mean unscanned. Explicit writes still run through PII detection before they are stored, because the goal of the Memory pillar is twofold — keep bad facts out of recall, and keep sensitive values out of a store that will later be searched and injected. A deliberate write can still carry a secret the author didn't notice. Provenance decides the review path; detection runs either way.
What this doesn't solve yet
This is early, and honesty serves the reader better than a clean story:
- Reviewer fatigue is real. Surface mode is the default precisely because strict mode adds friction many teams won't sustain. Strict mode is the right answer for high-sensitivity tenants; it is not free, and pretending otherwise would just get it switched off.
- Tombstones suppress identical content, by hash. A semantically equivalent re-phrasing of a rejected fact can still re-enter the queue and get re-reviewed. Catching paraphrase-level duplicates is a harder, later problem.
- Review is per-entry. Beyond the surface/strict toggle there is no policy language yet — no "auto-approve facts from this source, always hold facts matching this shape." That belongs in the design; it isn't built.
None of these change the shape of the argument. A memory layer is a control surface, not a convenience feature. Treat what your AI is allowed to remember with the same seriousness you treat what it is allowed to send — review what enters, make rejections stick, scope what is read, and inject it once.
See the memory layer as running code
Membrain is the open-source reference implementation — the review queue, tombstones, scoped reads, and idempotent injection are controls you can read and run yourself. Self-hosted, Apache-2.0.
Get started on GitHub →