Hacker News

I agree with the addition at the end -- I think this is a model limitation not a harness bug. I've seen recent Claudes act confused about who they are when deep in context, like accidentally switching to the voice of the authors of a paper it's summarizing without any quotes or an indication it's a paraphrase ("We find..."), or amusingly referring to "my laptop" (as in, Claude's laptop).

I've also seen it with older or more...chaotic? models. Older Claude got confused about who suggested an idea later in the chat. Gemini put a question 'from me' in the middle of its response and went on to answer, and once decided to answer a factual social-science question in the form of an imaginary news story with dateline and everything. It's a tiny bit like it forgets its grounding and goes base-model-y.

Something that might add to the challenge: models are already supposed to produce user-like messages to subagents. They've always been expected to be able to switch personas to some extent, but now even within a coding session, "always write like an assistant, never a user" is not necessarily a heuristic that's always right.

There is nothing specific to the role-switching here (as opposed to other mistakes), but I also notice them sometimes 1) realizing mistakes with "-- wait, that won't work" even mid-tool-call and 2) torquing a sentence around to maintain continuity after saying something wrong (amusingly blaming "the OOM killer's cousin" for a process dying, probably after outputting "the OOM killer" then recognizing it was ruled out).

Especially when thinking's off they can sometimes start with a wrong answer then talk their way around to the right one, but never quite acknowledge the initial answer as wrong, trying to finesse the correction as a 'well, technically' or refinement.

Anyhow, there are subtleties, but I wonder about giving these things a "restart sentence/line" mechanism. It'd make the '--wait,' or doomed tool-call situations more graceful, and provide a 'face-saving' out after a reply starts off incorrect. (It also potentially creates a sort of backdoor thinking mechanism in the middle of non-thinking replies, but maybe that's a feature.) Of course, we'd also need to get it to recognize "wait, I'm the assistant, not the user" for it to help here!