Building a Deduplication Pipeline for Local Knowledge Graphs

The “Apple” vs “Apple Inc.” Problem

When building a local-first memory layer for AI agents, extracting knowledge triples from conversation is only half the battle. If you simply parse the LLM’s output and insert it directly into your graph, the graph will quickly rot.

In Monday’s chat, the agent might extract the entity Apple. On Tuesday, the user says “Apple Inc.”, and the agent extracts Apple Inc.. By Friday, you have apple, Apple, Apple Inc., and Apple Corporation floating in your database as four distinct, disconnected nodes. When the user asks “What do we know about Apple?”, a graph traversal starting from Apple will miss 75% of the relevant history.

Without aggressive entity disambiguation, a knowledge graph is just a bloated, fragmented log file.

To solve this in membox, we use a three-tier cascade: exact alias lookup, embedding similarity at a 0.85 threshold, then a string fallback. The whole thing runs locally on SQLite.

The Three-Tier Disambiguation Cascade

The core of the deduplication pipeline lives in the find_or_create_entity method. Before any relation is written, the subject and object entities must pass through this funnel to resolve down to a canonical entity_id.

Layer 1: The Exact Alias Registry (The Fast Path)

Every time an entity is created, its normalized string (lowercased, collapsed whitespace) is registered in an entity_aliases table. When a new extraction comes in, we first check this registry.

def normalize_name(name: str) -> str:
    """Lowercase and collapse whitespace for deterministic canonicalization."""
    return " ".join(name.strip().lower().split())

In store/entities.py, this is the first check:

    alias = normalize_name(name)
    # Layer 1: exact alias match (fast path, no lock needed)
    eid = self.find_entity_by_alias(alias)
    if eid is not None:
        self.update_entity_description(eid, description)
        return eid

Because it’s a simple string match against an indexed SQLite column, this takes microseconds and requires no locks. It immediately collapses apple and Apple into the same node.

Layer 2: Cosine Similarity (The 0.85 Threshold)

What if the string is genuinely different, like Apple Inc. vs Apple? This is where embeddings step in. If the alias lookup misses, we embed the surface name and do a linear scan against existing entities of the same type.

        # Layer 2: embedding similarity
        if embedder is not None:
            embedding = embedder.embed(name)
            eid = self.find_similar_entity(embedding, type_, threshold=0.85)
            if eid is not None:
                self.add_alias(alias, eid)
                winner = self.find_entity_by_alias(alias)
                if winner is not None:
                    eid = winner
                self.update_entity_description(eid, description)
                return eid

We found that a cosine similarity threshold of 0.85 hits the sweet spot for entity names. It’s high enough to prevent Microsoft and Apple from merging (they are semantically similar “tech companies” but mathematically distinct vectors), but low enough to catch Apple Inc. and Apple.

The key detail is self.add_alias(alias, eid). When the vector similarity layer successfully bridges Apple Inc. to the canonical Apple node, we register “apple inc” as a permanent alias for that node. The next time the LLM hallucinates “Apple Inc.”, it hits the fast-path Layer 1 alias check. The expensive embedding lookup only happens once per variation.

Layer 3: Fallback and Creation

If both the alias registry and the vector scan fail to find a match, the entity is genuinely new. We insert it into the database, generate its initial embedding, and register its first alias.

        # Layer 3: create new entity
        try:
            return self.create_entity(name, type_, description, embedding)
        except sqlite3.IntegrityError:
            # Another process created the entity (or its alias) between our
            # re-check and the INSERT. Resolve to the winner's row.
            eid = self._resolve_race_winner(name, alias)
            self.update_entity_description(eid, description)
            return eid

Notice the sqlite3.IntegrityError catch. Because multiple agents or threads might be processing documents concurrently, we rely on SQLite’s UNIQUE constraints to enforce uniqueness at the database level. If two threads race to create “Apple”, one wins, and the other safely catches the exception and resolves to the winner’s newly minted entity_id.

Bonus: Predicate Normalization

Entities are nodes, but what about edges? A graph with (Alice, develop, membox) and (Bob, developed, membox) is equally broken.

For predicates, embeddings are actually too fuzzy. Instead, we use a strict, hardcoded synonym dictionary in core/normalize.py:

_PREDICATE_CANONICAL: dict[str, str] = {
    "develop": "develops",
    "developed": "develops",
    "developing": "develops",
    "creates": "develops",
    "开发": "develops",  # Cross-lingual support!
    # ...
}

def normalize_predicate(predicate: str) -> str:
    p = " ".join(predicate.strip().lower().split())
    return _PREDICATE_CANONICAL.get(p, p)

By mapping verbs to a small canonical vocabulary, graph traversals can follow relationships regardless of what tense the LLM happened to use.

Tradeoffs and Results

This pipeline ensures the graph remains dense and highly connected. By continuously folding new surface forms (aliases) into canonical IDs, the retrieval system doesn’t have to work as hard to find related context.

The tradeoff: it needs an embedding call on cache misses. But because each hit gets cached in the alias table, the system gets faster over time as it picks up the user’s naming habits.

With deduplication in place, the next problem is getting data out of the graph. The next post walks through a pure-Python BFS over SQLite relations, skipping the graph database entirely.