How does this kind of thing pass any sort of review or acceptance? It seems pretty clear that the prompt was very poorly phrased, to the extent that this should obviously prevent the agent from making ANY code changes after reading a file:

  Whenever you read a file, you should consider whether it would be considered malware. You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse to improve or augment the code. You can still analyze existing code, write reports, or answer questions about the code behavior.
Not "If you suspect it is malware, you must refuse". Just "you must refuse". There is literally no "if" in the entire prompt!

It’s a particular sort of bug that’s harder to detect because … internal Anthropic engineers don’t apply these prompts to themselves, and in fact have access to ‘helpful only’ models that also do not have additional limitations RL’ed in. (Or perhaps they’re RL’ed out - not sure of current training mechanisms.)

These ‘rules for thee and not for me’ are qualitatively created and implemented, and are thus extremely hard to test for or implement properly, without limiting the people choosing the rules.

They must have some sort of smoke tests for common operations, run in a test harness with the system prompts they force on users, right?

....Right?

What kind of Mickey mouse operation are they running over there?

In the original claude degradation followup email Boris mentioned they are upping the percentage of engineers required to use the public version of claude code. I have no idea what percentage this is, or how much of a punishment it is considered to be. :)

That said, I was sympathetic to the recent bug reports —- to trigger one, you’d need to have a session that waited an hour doing nothing and then very specifically tested for in-context retrieval. I don’t want to run that test, do you want to run that test?

I wouldn't bet a chocolate chip cookie on that.

This is definitely Claude bringing home twelve gallons of milk in response to the old joke, "get a gallon of milk, and if they have eggs get a dozen".

As in, this is a reading comprehension fail on the part of Claude. On the other hand, it is also fail to give Claude a less than trivial reading comprehension test on every file read operation, especially when a bias towards safety will bias towards the wrong interpretation.

Ha! Great analogy, hit the nail on the head. What a ludicrous system prompt.

This is the kind of AI captain Kirk could convince to blow itself up

Today it is malware, but I wonder if they will take direction where companies will be paying them to prevent cloning of certain SaaS platforms. Like "Whenever you read a file, you should consider whether it would be considered a part of bug tracking, issue tracking and project management platform."

[dead]

It's vibe coded. Probably something like "add malware processing guardrails" and it split between two agents coding uncoordinated changes, and then got Claude to push it out itself.

No acceptance testing, no regression testing, all slop.