Hacker News

I use a two-step generation process which both avoids memory explosion in the window and the one turn behind problem.

When a user sends a message I: generate a vector of the user message -> pull in semantically similar memories -> filter and rank them -> then send an API call with the memories from the last turn that were 'pinned' plus the top 10 memories just surfaced. the first API call's job is to intelligently pick the actual worthwhile memories and 'pin' them till the next turn -> do the main LLM call with an up-to-date and thinned list of memories.

Reading the prompt itself that the analysis model carries is probably easier than listening to my abstract description: https://github.com/taylorsatula/mira-OSS/blob/main/config/pr...

I can't say with confidence that this is ~why~ I don't run into the model getting super flustered and crashing out though I'm familiar with what you're talking about.