Measuring AI Memory: Fusing BM25 with Graph Traversal

In the last post, we covered how SQLite’s WAL mode and a Python RLock let multiple agents write to the knowledge graph concurrently and safely. The data is in now, and the next question is: when an agent tries to recall a detail, how do we find the right one?

You might think we already solved this. We stored the memory as a knowledge graph, so just follow the edges with BFS and we’re done, right?

That was the biggest hole we fell into early in membox.

Why pure graph traversal isn’t enough (the long-text blind spot)

Pure graph traversal assumes the extraction model perfectly translates every sentence in a document into subject-predicate-object triples. It doesn’t. For long technical passages, large code samples, and continuous context, the LLM either extracts them crudely or collapses them into one vague sentence like (Agent, discussed, SQL Schema). When the user then asks “what are the exact fields in the users table in that SQL schema,” following that edge gets you a single sentence. The original text (the evidence chunk) buried somewhere in the long document gets truncated because its weight on the graph is too low.

Why not just use vector retrieval (RAG)? (losing structure)

If long-document matching is hard, why not drop the graph and fall back to traditional vector RAG (chunking + embedding + cosine similarity)?

Because pure vector retrieval loses the logical chain. If the user asks “what did company A’s CEO say about project B yesterday,” we first need the graph edge A_company -> CEO, then follow CEO -> discussed -> project B, then read the text attached to that edge. Vector retrieval is good at matching literal meaning, but it gets pulled off course by documents that share surface words (“CEO,” “project”) with no logical connection.

We needed a fusion: keep the graph’s logical depth, but use term-frequency-based hard matching to recover the text that gets buried.

The combined score: graph + vector + full-text

In membox’s architecture (Phase 7.5), we brought in SQLite’s built-in full-text search extension, FTS5, and fused its BM25 ranking with the graph BFS. The combined scoring formula we settled on:

score(t) = decay^hops(t) * ( α·sim(t) + (1-α)·bm25(t) )

hops(t): BFS hops. More hops means a heavier decay penalty.
sim(t): vector cosine similarity, normalized to 0-1. When the embedder isn’t available offline, this degrades to 0.
bm25(t): BM25 score from SQLite FTS5 over the original long text (evidence) attached to the triple, negated and normalized.

In code, the scoring stays compact. The FTS5 query in core/retrieval.py:

# Get normalized BM25 scores
def _bm25_scores_for_relations(self, relation_ids: list[int], query: str) -> dict[int, float]:
    conn = self._cm.connection()
    table_name, match_expr = _fts_table_and_query(conn, query)
    placeholders = ",".join("?" * len(relation_ids))

    # SQLite bm25() returns negative numbers: a smaller value means more relevant, so negate it in SQL
    rows = conn.execute(
        f"SELECT re.relation_id, -bm25({table_name}) AS score "
        f"FROM {table_name} "
        f"JOIN relation_evidence re ON re.doc_id = {table_name}.rowid "
        f"WHERE {table_name} MATCH ? "
        f"  AND re.relation_id IN ({placeholders})",
        [match_expr, *relation_ids],
    ).fetchall()

    # Then min-max normalize the raw scores to [0, 1] in Python ...

Dynamic token budget and knapsack truncation

However precise the scoring, the model’s context window is finite. For small local models it can be 2048 to 8192 tokens, so we can’t just shovel every high-scoring relation and long text at the model.

We need truncation at retrieval time. Plain top-K truncation is unreliable because chunks vary in length. We use estimated dynamic token budgeting plus a greedy knapsack.

First, we wrote a deterministic, fast token estimator for offline use, so we don’t have to call tiktoken:

# est_tokens(s) = CJK_char_count(s) + ceil(non_CJK_char_count(s) / 4)

Then we fill in multiple greedy passes. We split a 2000-token budget across two pools: a graph-triple pool (Pass 1) and an FTS raw-text pool (Pass 2), plus a backfill pass (Pass 3):

def compact_output(self, scored: list[dict], budget: int) -> str:
    # Greedy knapsack: by descending score, admit an item while the remaining budget allows
    remaining = budget
    admitted_rids = []

    for item in scored:
        line = f"{item['subject']}: {item['predicate']} {item['object']}"
        cost = est_tokens(line)
        if cost <= remaining:
            admitted_rids.append(item['relation_id'])
            remaining -= cost

    # Then allocate the remaining budget to the long evidence snippets bound to high-scoring triples
    # Finally, force-append a real-usage footnote (e.g. "returned 45/120 triples, ~1980/2000 tokens")

Measured recall: 92.3%

On a sampled eval set pulled from membox’s real session logs (the Handoffs corpus): 26 hard questions covering single-hop, multi-hop, and time-sensitive updates. The offline numbers from our CI monitoring:

Pure graph-extraction baseline: ~53.8% hit rate. Long text and CJK keywords got dropped because the entities were extracted too “cleanly.”
After adding BM25 fusion and the token budget: hit rate rises to 88.5%.
After adding a trigram tokenizer side-table tuned for CJK: hit rate settles at 92.3% (24/26). All multi-hop queries recovered their logical chains.

Tradeoffs and wrap-up

Fusing the graph structure with SQLite FTS5 brought retrieval recall up to production levels, with no external vector database and no microservice dependencies. The whole thing runs in-process and finishes in seconds.

The knowledge graph now stores concurrent extractions cleanly and recalls them accurately when an agent needs them. But over an agent’s long lifetime, memory can’t just pile up forever. Human brains forget, compress, and turn short-term experience into long-term conviction.

The next post covers the core of membox: the AI memory lifecycle (the AI Memory Funnel), how a memory goes from trace to unit, then crystallizes and updates itself.