I don't think that article implies what you say, i.e. that Claude "knows" when it's hallucinating.
First of all:
>similar weights are activated for 'lying' and 'hallucinating'
Are we talking about inference time when seeing these tokens? Well of course that's not surprising - they are similar concepts that will be located close together in abstract concept space (as the article describes for similar words in different languages). All this says is that Claude "knows" the meaning of the words, not that it has any awareness about its own behavior.
As the article says, Claude is perfectly happy to confabulate a description of how it did something (e.g. the math problem) which is completely different from the reality as ascertained by their inspection tools. Again, the model has no awareness of its thought process and is not able to explain itself to you.
>I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training
The part of the article about jailbreaking seems to put it pretty simply:
>We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.
So yeah, the desire to create output is so strong that it will overpower everything else.
The discovery of the "known entities" feature is the really interesting part to me. Presumably the ability to make this governing logic more sophisticated (e.g. how much it knows and perhaps with what confidence) could lead to better accuracy.