One thing I've been wrestling with building persistent agents is memory quality. Most frameworks treat memory as a vector store — everything goes in, nothing gets resolved. Over time the agent is recalling contradictory facts with equal confidence.
The architecture we landed on: ingest goes through a certainty scoring layer before storage. Contradictions get flagged rather than silently stacked. Memories that get recalled frequently get promoted; stale ones fade.
It's early but the difference in agent coherence over long sessions is noticeable. Happy to share more if anyone's going down this path.
Interesting. I’ve been playing with something similar, at the coding agent harness message sequence level (memory, I guess). I’m looking at human driven UX for compaction and resolving/pruning dead ends
Human-driven compaction is interesting — you sidestep the "what's worth keeping" problem by putting a person in the loop. The tradeoff I've hit is that agents running autonomously need it to happen automatically or coherence degrades fast between sessions.
For pruning we landed on a last-touched timestamp + recall frequency counter per memory. Things not accessed in N sessions that were weakly formed to begin with get soft-deleted. Human review before hard delete is probably better UX if your setup allows it.
Curious what "dead ends" look like in yours; conversational chains that didn't resolve, or factual ones?
> The tradeoff I've hit is that agents running autonomously need it to happen automatically or coherence degrades fast between sessions.
Yeah that makes total sense. I wonder (and am sure the labs are doing so) if the HitL output would be good to fine tune the models used to do it autonomously?
I’m sticking with humans for the moment because I’m not sure where the boundaries lie: what actually makes it better and what makes it worse. It’s non obvious so far
Pruning “loops” has been pretty effective though, where a model gets stuck over N turns checking the same thing over and over and not breaking out of it til way later. That has been good because it gives strong context size benefits, but is also the most automatable I think
Pruning factually incorrect turns is something I’m trying, and pruning “correct” but “not correct based on my style” as well. Building a dataset of it all is fun :)
Sounds interesting, would like to learn more about this.
How do you imokement the scoring layer and when and how is it invoked?
The scoring layer sits between ingestion and storage. Incoming items get evaluated on a few axes: source reliability (did the agent observe this directly or was it told?), semantic distance from existing memories, and recency weighting for time-sensitive facts.
Contradiction detection runs as a separate step - we embed the incoming memory, similarity-search against existing ones, and score the pair for logical consistency. If it trips a threshold, it gets stored with a conflict flag and a link to the contradicting memory rather than silently overwriting.
The agent sees both during retrieval and reasons about which to trust in context. Sounds like overhead but it's fast — the scoring is a simple feedforward pass, not another LLM call.
Thanks for that. I'm new to the applied AI / ML world.
What's your stack and infra setup? Mainly Python, AWS, Databricks?
-
PS. previous comment typo: 'imokement' should have read 'implement'
[dead]
certainty scoring sounds useful but fwiw the harder problem is temporal - a fact that was true yesterday might be wrong today, and your agent has no way to know which version to trust without some kind of causal ordering on the writes.
You're right, and it's the part that keeps me up. We handle it with versioned writes — each memory has a createdAt, observedAt, and a validUntil that can be set explicitly or inferred from context. Temporal scope gets embedded as metadata: "as of last session" vs "persistent fact."
Causal ordering is harder. Right now we surface both conflicting versions during retrieval with timestamps and let the agent reason about which is authoritative. It's not a complete solution — the agent can still pick wrong without the right reasoning context.
What you're describing is architecturally the right answer. We haven't built proper write-ordering yet. That's probably where the next cycle goes.