Markdown Is Sovereign, Postgres Is a Cache: Engineering a Rebuildable Agent-Memory Substrate
The Governing Principle
For a long-lived autonomous agent, memory is the substrate everything else runs on, and silent epistemic drift — a true fact quietly superseded by a false one — is the failure mode that matters most. Our design starts from one principle that resolves most of the hard questions by itself: Markdown is the source of truth; Postgres is a derived, rebuildable cache.
Durability lives in Git. The canonical store is a few hundred Markdown topic files plus a small set of always-loaded core blocks — a constitution, an operator profile, a capped durable-facts file, and a volatile follow-ups file — all version-controlled. Everything in the database is reconstructable from that vault. The index is therefore disposable *for durability* but availability-critical *at runtime*: a cold rebuild on CPU is a multi-tens-of-minutes blackout, so recovery is snapshot-restore, not rebuild. That single distinction — durable in Git, available via Postgres — drives the rest of the architecture.
The Read Path
Retrieval is hybrid. A BM25 lexical query over an inverted index runs alongside dense cosine similarity over an HNSW vector index, and the two result sets are fused by Reciprocal Rank Fusion. Fusion is deliberately chosen over score-blending because BM25 and cosine scores are not on comparable scales; rank fusion sidesteps that entirely. After fusion, a light provenance rerank blends three signals: semantic similarity, a machine-derived confidence, and a recency decay with a ninety-day half-life. The net effect is that a recent, high-provenance fact outranks a stale or low-trust one even when raw similarity is close. Retrieval latency sits in the low tens of milliseconds at the median.
The Trust Model Is the Spine
The interesting engineering is not the retriever; it is the discipline around *what is allowed to become a memory*.
Metadata is machine-written, never model-assigned. Confidence (0–100) is computed mechanically from the source class of a fact — where it came from determines how much it is trusted — and is never set by a language model. Staleness uses a bitemporal schema: a superseding fact stamps the old one's validity end-date rather than deleting it, so history is preserved and supersession is auditable. The motivation was empirical and damning: of several hundred curated topic files, exactly *one* carried a machine-readable time field, while dozens carried hand-written "corrected/superseded" prose. Hand-maintained metadata decays to noise; it has to be derived.
The write path is an append-only ledger. Structured writes land in an event log drained every thirty seconds by a *single* materializer using atomic temp-and-rename writes, lock-coordinated claiming, one synthetic commit per batch, and path-traversal guards. One writer, one commit path — no races.
The public surface is untrusted by default. Writes carry a trust class. Operator and terminal writes are trusted; anything ingested from an external or public interface is not, lands in a quarantine namespace, and can never auto-promote without operator review. Truth maintenance is propose-only: the reconciliation pass proposes supersessions into a review queue rather than committing inline, and the language model inside that pass is treated as an injection sink — every model-facing read is sanitised and framed as untrusted data. The result is a hard guarantee: a poisoned contradiction can never silently retire a true fact.
We were deliberate about what we *refused* to build, too. We borrowed bitemporal supersede-don't-delete, sleep-time consolidation, and the episodic/semantic split from the current agent-memory literature. We declined full graph-RAG (it costs many times the tokens for our query mix), reinforcement-learned memory policies, and any inline LLM write gate. The sharpest refusal was earned: a separate epistemic-fact database — the frontier's headline recommendation, complete with a refresh queue and dynamic access control — had been built and then deleted as unwired dead weight, precisely because it lived in a database the Markdown could not see. *Don't track facts where the source of truth can't see them.*
"Your Eval Was Measuring the Wrong Thing"
The headline result is a cautionary tale about evaluation. The adversarial retrieval harness was reporting recall@5 of 0.073 — a roughly 7% success rate that looked like a broken retriever. It was not. The eval seeder had sampled its question/answer ground truth only from a few directories and *never from the topic files where the bulk of the curated facts actually live*. So the harness was scoring retrieval as a failure while the retriever was returning the correct topic file as result number one — the ground-truth pointer simply named a different file. The eval was measuring a mismatch, not retrieval quality.
Rebuilding the golden set from the right source — deactivating the misaligned pairs and seeding fresh, topic-aligned, category-balanced ones — moved the numbers on the identical harness from recall@5 0.073 → 0.629 (8.6×), recall@10 0.091 → 0.743, and nDCG@5 0.018 → 0.394. Nothing about the retriever changed. We fixed the question, not the answer, and the regression gate went from theatre to a real instrument.
The lesson generalises past memory systems: a retrieval metric is only as honest as the ground truth it samples from, and a confidently low score is just as likely to indict your eval as your system. Before you tune the model, make sure the test is pointed at the right corpus.