Hacker News

How do you handle entity clustering/deduplication?

We use a two-layer approach.

The raw sync layer (Gmail, calendar, transcripts, etc.) is idempotent and file-based. Each thread, event, or transcript is stored as its own Markdown file keyed by the source ID, and we track sync state to avoid re-ingesting the same item. That layer is append-only and not deduplicated.

Entity consolidation happens in a separate graph-building step. An LLM processes batches of those raw files along with an index of existing entities (people, orgs, projects and their aliases). Instead of relying on string matching, the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.

delichon 9 hours ago [ - ]

> the model decides whether a mention like “Sarah” maps to an existing “Sarah Chen” node or represents a new entity, and then either updates the existing note or creates a new one.

Thanks! How much context does the model get for the consolidation step? Just the immediate file? Related files? The existing knowledge graph? If the graph, does it need to be multi-pass?

segmenta 9 hours ago [ - ]

The graph building agent processes the raw files (like emails) in a batch. It gets two things: a lightweight index of the entire knowledge graph, and the raw source files for the current batch being processed.

Before each batch, we rebuild an index of all existing entities (people, orgs, projects, topics) including aliases and key metadata. That index plus the batch’s raw content goes into the prompt. The agent also has tool access to read full notes or search for entity mentions in existing knowledge if it needs more detail than what’s in the index.

It’s effectively multi-pass: we process in batches and rebuild the index between batches, so later batches see entities created earlier. That keeps context manageable while still letting the graph converge over time.