If it's not coding, even with 200k context it starts to write gibberish, even with the correct information in the context.
I tried to ask questions about path of exile 2. And even with web research on it gave completely wrong information... Not only outdated. Wrong
I think context decay is a bigger problem then we feel like.
Fwiw put a copy of the game folder in a directory and tell claude to extract game files and dissasemble the game in preparation for questions about the game.
As an example of doing this in a session with jagged alliance 3 (an rpg) https://pastes.io/jagged-all-69136
Claude extracting game archives and dissasembling leads to far more reliable results than random internet posts.
You’re having Claude design builds for you by disassembling the game? Am I understanding that right? I guess I’m thinking too small.
Yes exactly. Claude can just go in, extract the compressed game archives, decompile and read the game logic directly for how everything works. ie. You might be curious how certain stats translate into damage. Just do the above and ask Claude "in detail explain from the decompiled code in this folder for game X how certain stats affect damage and suggest builds to maximise damage taking into account character level <10.".
I've found doing this for games to be far more reliable than trying to find internet posts explaining it. I haven't played POE but if it's anything like any other RPG system Claude will do a great job at this.
This will not work for an online game like PoE 2
Or even one with DRM?
Right?
Or?
DRM just stops you launching/connecting to servers if you modified the binary. It does nothing to stop the binary being pulled apart by a bot with no intention of running it.
The place it may fail is obfuscation and server side logic. But generally client side logic, especially in a game with a scripted language backing it, is super easy for claude ot pick apart.
Context decay is noticeable within 3 messages, nearly every time. Maybe not substantial, but definitely noticeable.
It’s lead to me starting new chats with bigger and bigger starting ‘summary, prompts to catch the model up while refreshing it. Surely there’s a way to automate that technique.
Yeah absolutely, at this point I also start new chats after 3-4 prompts. Especially with thinking models that produce so many tokens.
Usually things go smoothly but sometimes I have situations like: “please add feature X, needs to have ABCD.” -> does ABC correct but D wrong -> “here is how to fix D” -> fixes D but breaks AB -> “remember I also want AB this way, you broke it” -> fixes AB but removes C and so on
I've found the same thing. I build with Claude Code daily and the context decay is real by the end of a long session it starts forgetting decisions we made earlier. The 1M context window should help but I'm curious how coherence holds up at that scale.
What's been working for me is keeping a CLAUDE.md file in my project root with key decisions and context. The model reads it at the start of every session so I don't have to re-explain everything. Not as elegant as automated compaction but it works.
> I build with Claude Code daily and the context decay is real by the end of a long session it starts forgetting decisions we made earlier
I generate task.md files before working on anything, some are short, others are super long and with many steps. The models don't deviate anymore. One trick is to make a post tool use hook to show the first open gate "- [ ]" line from task.md on each tool call. This keeps the agent straight for 100s of gates.
After each gate is executed we don't just check it, we also append a few words of feedback. This makes the task.md become a workbook, covering intent, plan, execution and even judgements. I see it like a programming language now. I can gate any task and the agent will do it, however many steps. It can even generate new gates, or replan itself midway.
You can enforce strict testing policies by just leaning into gate programability power - after each work gate have a test gate, and have judges review testing quality and propose more tests.
The task.md file is like a script or pipeline. It is also like a first class function, it can even ingest other task.md files for regular reflexion. A gate can create or modify gates, or tasks. A task can create or modify gates or tasks.
It could also be a skill problem. It would be more helpful if when people made llm sucks claims they shared their prompt.
The people I work with who complain about this type of thing horribly communicate their ask to the llm and expect it to read their minds.
I don't really understand what you mean by this. The claim is that the same prompt with the same question produces worse results when it's queried in a model that has more than 200k tokens in its context. That doesn't have to do much with the "skillfulness" of using a model.
Prompt quality does matter, but at some point context side does matter.
I’ve had thing like a system that has a collection of procedural systems. I would say “replace the following set of defaults that are passed all around for system X (list of files) and in the managed (file) by a config” and it would do that but I’d suddenly see it be like “wait mu and projection distance are also present in system Y and Z. Let me replace that by a config too with the same values”. When system Y and Z uses a different set of optimized values, and that was clearly outside of the scope.
Never had that kind of mistakes happen when dealing with small contexts, but with larger contexts (multiple files, long “thinking” sequences) it does happen sometimes.
Definitely some times when I though “oh well my bad, I should have clarified NOT to also change that other part”, all the while thinking that no human would have thought to change both
None of what has been described is a "skill issue". The problem is when an identical prompt produces poor results once the context window exceeds 200k tokens or so.
Totally agree the LLM sucks posts should be accompanied with the prompt.
I agree, but at the same time it feels like victim blaming.
Nah, it's a variant of the XY Problem: https://xyproblem.info
I don't know. Is pointing out that someone holding a drill by the chuck won't get the results they expect that bad?
Adding web search doesn't necessarily lead to better information at any context.
In my experience the model will assume the web results are the answer even if the search engine returns irrelevant garbage.
For example you ask it a question about New Jersey law and the web results are about New York or about "many states" it'll assume the New York info or "many states" info is about New Jersey.
I think ChatGPT has a huge advantage here. They have been collecting realistic multi-turn conversational data at a much larger scale. And generally their models appear to be more coherent with larger contexts for general purpose stuff.
The question that comes to mind for me after reading your comment is how can a question about a game require that much context?
Path of exile is complex, just check the skill tree , skills and gems:)
It could almost be used as a benchmark good models are in math, memory, updated information etc
I feel like few weeks ago i suddenly had a week where even after 3 messages it forgot what we did. Seems fixed now.
We need an MCP for path of building
Agreed, there's no getting around the "break it into smaller contexts" problem that lies between us and generally useful AI.
It'll remain a human job for quite a while too. Separability is not a property of vector spaces, so modern AIs are not going to be good at it. Maybe we can manage something similar with simplical complexes instead. Ideally you'd consult the large model once and say:
> show me the small contexts to use here, give me prompts re: their interfaces with their neighbors, and show me which distillations are best suited to those tasks
...and then a network of local models could handle it from there. But the providers have no incentive to go in that direction, so progress will likely be slow.
That’s not context decay, that’s training data ambiguity. So much misinformation, nerfs, buffs, changes that an LLM can not keep up given the training time required. Do it for a game that has been stable and it knows its stuff.
It didnt gave outdated, on some cases it did, and with two tries telling it to search for updated information it got it right ( shouldn't need to do that though) but it also gave wrong information about sockets ( support skills) , which never existed or never were able to be socketed together in the first place. ( Ok maybe in 0.1, but that's what web search is for ... ) If it even can't handle easy versioned information from a game. How should it handle anything related to time, dates, news, science etc?
Like any human would, 75% certain with 99% confidence. That’s what you fail to realize. They aren’t “god mode machine”. They are “human-mode” machines and humans make mistakes in thinking just like you do. Some might say asking a powerful LLM for gaming tips is a waste of compute power. Others might say it gives you the knowledge of a new meta emerging. Either way, you both are going to get trained.
Please don’t pop the AI bubble, bro. Stop asking questions, bro. Believe the hype, bro.
What were you asking about PoE 2? So far my _general_ experience with asking LLMs about ARPGs has been meh. Except for Diablo 2 but I think that’s just because Diablo 2 has been heavily discussed for ~25 years.
Number one thing you always need to accomplish are feedback loops for Claude so it's able to shotgun program itself to a solution.