Context: Why “Just Embed Everything” Stops Working
Large language models are stateless. Every conversation starts from a blank context window, and any signal the user gives the model disappears the moment the session ends. The industry answer to this is retrieval-augmented generation: embed the conversation, store the vectors, retrieve the closest ones when a new prompt arrives. For straight question-answering, this works well.
The failure mode shows up when the user’s request depends on relational context.
“What did we decide about checkout last week, and which service was on call when the latency spike happened?”
That is not a similarity query. It is a graph traversal. Pure vector RAG surfaces fragments that are semantically close but logically disconnected, and the model stitches them together confidently and incorrectly. The model “remembers” the right keywords but not the right structure.
The other extreme is a full graph database — Neo4j, Memgraph, or FalkorDB. For a personal assistant that owns one user’s history, this is over-engineering. The deployment, the running cost, the schema migrations, and the operational surface all outweigh the actual retrieval needs.
The middle ground is a small, local-first design that fits on one machine and gives the model enough structure to do real reasoning without standing up a distributed system.
Engineering Decision: SQLite + In-Memory Graph
The architecture has two layers that share a single source of truth:
- SQLite in WAL mode holds every persisted artifact: raw messages, extracted entities, and the relations between them as subject-predicate-object triples. WAL mode (write-ahead logging) keeps reads non-blocking while a write transaction is in progress, so the retrieval path never waits on the extraction path.
- An in-memory graph built from those triples sits in front of the store. The graph is rebuilt on startup and incrementally extended as new triples land. Retrieval is a small BFS from the seed entities, measured in milliseconds.
The boundary is sharp: SQLite is the source of truth, the graph is a derived cache. If the process crashes, the graph rebuilds from disk. If the graph is wrong, the disk is still right.
This design is intentionally not horizontally scalable. It is built for a single agent serving a single user, where the “scale” problem is the user’s history, not the request rate. Once that framing lands, the design space collapses dramatically.
Implementation Notes
The data model has three tables:
messages— the raw conversation, with timestamps and a session id. The body is stored as the original text plus any structured fields the model emitted.entities— the named concepts the LLM has lifted out of conversation: services, incidents, decisions, people, dates.relations— the subject-predicate-object edges between entities. The(subject_id, predicate, object_id)triple is unique, so re-extracting the same fact is idempotent.
The extraction pipeline runs after each assistant turn:
- The LLM is prompted to return a JSON list of entities and triples for the latest exchange.
- The triples are normalized: entity names are lowercased and stripped, and the model is asked to re-use existing entity ids when a referent already exists.
- Everything is written in a single SQLite transaction. If the LLM response is malformed, the transaction rolls back and the raw message is still saved, so the next extraction attempt can re-process it.
Retrieval is two-stage. A vector index — also kept in SQLite, via sqlite-vss or a flat cosine over a small embedding table — surfaces the k most similar messages. Those messages’ entities become the seed nodes for an in-memory BFS that walks the graph up to depth 3, accumulating the connected relations. The result is a small, structured context payload that is then passed to the model alongside the user’s prompt.
The whole loop, end to end, runs in a single Python process with the standard library plus sqlite3 and one embedding call.
Tradeoffs
- Memory growth is unbounded. Every conversation adds rows. For a personal assistant this is fine for years; for a fleet of agents it is not. The next step is summarization, not sharding.
- Extraction quality is the bottleneck. A wrong or missing triple silently degrades retrieval. There is no fallback to similarity search for relations that the model failed to extract. This is the single largest source of “the assistant forgot” reports in practice.
- The graph is single-process. Two agents cannot share memory without going through SQLite, which means no live graph merging. This is acceptable for the target use case and a hard limit for anything collaborative.
- Schema drift is on the LLM. The model chooses what counts as an entity and what predicates to use. A small validation pass keeps the predicates in a known set, but the entity taxonomy is whatever the model invents on any given day.
What I Would Do Next
A reference implementation in pure Python, leaning only on the standard library plus sqlite3 (which ships with Python). No ORMs, no graph libraries, no vector stores. The point is to show that the architecture fits in a few hundred lines and runs on a laptop.
After that, the obvious extensions are:
- A summarization pass that compresses old triples into higher-order facts once a session is closed, so the in-memory graph stays bounded.
- A reflection prompt that asks the model to re-read its own extracted triples periodically and flag contradictions.
- A schema-driven predicate vocabulary, so two agents using the same memory layer can interoperate.
Source code and benchmarks will follow in a follow-up post. If you have hit similar problems with agent memory, the comparison points I would be most interested in are: extraction latency, graph size at N months of usage, and retrieval precision on multi-hop questions.
