I'm surprised to see this getting so much positive reception. In my experience AI is still really bad with documenting the exact steps it took, much more so when those are dependent on its environment, and once there's a human in the loop at any point you can completely throw the idea out the window. The AI will just hallucinate intermediate steps that you may or may not have taken unless you spell out in exact detail every step you took.

People in general seem super obsessed with AI context, bordering on psychosis. Even setting aside obvious examples like Gas Town or OpenClaw or that tweet I saw the other day of someone putting their agents in scrum meetings (lol?), this is exactly the kind of vague LLM "half-truth" documentation that will cascade into errors down the line. In my experience, AI works best when the ONLY thing it has access to is GROUND TRUTH HUMAN VERIFIED documentation (and a bunch of shell tools obviously).

Nevertheless it'll be interesting to see how this turns out, prompt injection vectors and all. Hope this doesn't have an admin API key in the frontend like Moltbook.

That can happen if the history got compacted away in a long session. But usually AI agents also have a way to re-read the entire log from the disk. Eg Claude Code stores all user messages, LLM messages and thinking traces, tool calls etc in json files that the agent can query. In my experience it can do it very well. But the AI might not reach for those logs unless asked directly. I can see that it could be more proactive but this is certainly not some fundamental AI limitation.

[deleted]

I have completely different experience. Which models are you talking about? I have no trouble at all with AI documenting the steps it took. I use codex gpt5.4 and Claude code opus 4.6 daily. When needed - they have no issue with describing what steps they took, what were the problems during the run. Documenting that all as a SKILL, then reuse and fix instructions on further feedback.

I use mainly Opus 4.6.

I did the same thing and created a skill for summarizing a troubleshooting conversation. It works decently, as long as my own input in the troubleshooting is minimal. i.e. dangerously-skip-permissions. As soon as I need to take manual steps or especially if the conversation is in Desktop/Web, it will very quickly degrade and just assume steps I've taken (e.g. if it gave me two options to fix something, and I come back saying it's fixed, it will in the summary just kind of randomly decide a solution). It also generally doesn't consider the previous state of the system (e.g. what was already installed/configured/setup) when writing such a summary, which maybe makes it reusable for me, somewhat, but certainly not for others.

Now you could say, "these are all things you can prompt away", and, I mean, to an extent, probably. But once you're talking about taking something like this online, you're not working with the top 1% proompters. The average claude session is not the diligent little worker bee you'd want it to be. These models are still, at their core, chaos goblins. I think Moltbook showed that quite clearly.

I think having your model consider someone else's "fix" to your problem as a primary source is bad. Period. Maybe it won't be bad in 3 generations when models can distinguish noise and nonsense from useful information, but they really can't right now.

Isn’t what you’ve just described - the context bloat problem, the part about the web?

I’m not sure I quite get the same experience as you with the “assumes steps it never took”. Do you think it’s because of the skills you’ve used?

I also disagree that having at least some solution to a similar problem is inherently bad. Usually it directs the LLM to some path that was verified, if we’re talking about skills

The steps they say they took and steps they took are not the same thing.