Skip to content

Building a Local-First Memory Layer for AI Agents

Updated: at 08:00 AM

Context: Why “Just Embed Everything” Stops Working

Large language models are stateless. Every conversation starts from a blank context window, and any signal the user gives the model disappears the moment the session ends. The industry answer to this is retrieval-augmented generation: embed the conversation, store the vectors, retrieve the closest ones when a new prompt arrives. For straight question-answering, this works well.

The failure mode shows up when the user’s request depends on relational context.

“What did we decide about checkout last week, and which service was on call when the latency spike happened?”

That is not a similarity query. It is a graph traversal. Pure vector RAG surfaces fragments that are semantically close but logically disconnected, and the model stitches them together confidently and incorrectly. The model “remembers” the right keywords but not the right structure.

The other extreme is a full graph database — Neo4j, Memgraph, or FalkorDB. For a personal assistant that owns one user’s history, this is over-engineering. The deployment, the running cost, the schema migrations, and the operational surface all outweigh the actual retrieval needs.

The middle ground is a small, local-first design that fits on one machine and gives the model enough structure to do real reasoning without standing up a distributed system.

Engineering Decision: SQLite + In-Memory Graph

The architecture has two layers that share a single source of truth:

  1. SQLite in WAL mode holds every persisted artifact: raw messages, extracted entities, and the relations between them as subject-predicate-object triples. WAL mode (write-ahead logging) keeps reads non-blocking while a write transaction is in progress, so the retrieval path never waits on the extraction path.
  2. An in-memory graph built from those triples sits in front of the store. The graph is rebuilt on startup and incrementally extended as new triples land. Retrieval is a small BFS from the seed entities, measured in milliseconds.

The boundary is sharp: SQLite is the source of truth, the graph is a derived cache. If the process crashes, the graph rebuilds from disk. If the graph is wrong, the disk is still right.

This design is intentionally not horizontally scalable. It is built for a single agent serving a single user, where the “scale” problem is the user’s history, not the request rate. Once that framing lands, the design space collapses dramatically.

Implementation Notes

The data model has three tables:

The extraction pipeline runs after each assistant turn:

  1. The LLM is prompted to return a JSON list of entities and triples for the latest exchange.
  2. The triples are normalized: entity names are lowercased and stripped, and the model is asked to re-use existing entity ids when a referent already exists.
  3. Everything is written in a single SQLite transaction. If the LLM response is malformed, the transaction rolls back and the raw message is still saved, so the next extraction attempt can re-process it.

Retrieval is two-stage. A vector index — also kept in SQLite, via sqlite-vss or a flat cosine over a small embedding table — surfaces the k most similar messages. Those messages’ entities become the seed nodes for an in-memory BFS that walks the graph up to depth 3, accumulating the connected relations. The result is a small, structured context payload that is then passed to the model alongside the user’s prompt.

The whole loop, end to end, runs in a single Python process with the standard library plus sqlite3 and one embedding call.

Tradeoffs

What I Would Do Next

A reference implementation in pure Python, leaning only on the standard library plus sqlite3 (which ships with Python). No ORMs, no graph libraries, no vector stores. The point is to show that the architecture fits in a few hundred lines and runs on a laptop.

After that, the obvious extensions are:

Source code and benchmarks will follow in a follow-up post. If you have hit similar problems with agent memory, the comparison points I would be most interested in are: extraction latency, graph size at N months of usage, and retrieval precision on multi-hop questions.