I can see this working with "evil" and "sycophantic" personas. These seem like traits that would be amenable to input and thus be detectable by manipulating the input.

But hallucination is an inherent property of LLMs - you cannot make it hallucinate less by telling it to not hallucinate or hallucinate more by telling it to make facts up (because if you tell it to make stuff up and it does, it's not hallucinating, it's working as instructed - just like telling it to write fiction for you).

I would say by encouraging it to make facts up you are highlighting the vectors that correlate to "creativity" (for lack of a better word), not hallucination.

Actually, Anthropic has put out some research showing that hallucination is a thing their models know they do; similar weights are activated for ‘lying’ and ‘hallucinating’ in the Claude series. Implication - Claude knows - at least mostly - when its hallucinating.

I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training — you’re supposed to at least put something out there during training to get a score - and not necessarily a result of model. Overall I think that’s hopeful!

EDIT: Update, getting downvoted here.. Interesting! Here’s a link to the summary of the paper. https://www.anthropic.com/research/tracing-thoughts-language...

I don't think that article implies what you say, i.e. that Claude "knows" when it's hallucinating.

First of all:

>similar weights are activated for 'lying' and 'hallucinating'

Are we talking about inference time when seeing these tokens? Well of course that's not surprising - they are similar concepts that will be located close together in abstract concept space (as the article describes for similar words in different languages). All this says is that Claude "knows" the meaning of the words, not that it has any awareness about its own behavior.

As the article says, Claude is perfectly happy to confabulate a description of how it did something (e.g. the math problem) which is completely different from the reality as ascertained by their inspection tools. Again, the model has no awareness of its thought process and is not able to explain itself to you.

>I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training

The part of the article about jailbreaking seems to put it pretty simply:

>We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.

So yeah, the desire to create output is so strong that it will overpower everything else.

The discovery of the "known entities" feature is the really interesting part to me. Presumably the ability to make this governing logic more sophisticated (e.g. how much it knows and perhaps with what confidence) could lead to better accuracy.

> Claude knows - at least mostly - when its hallucinating.

This is really interesting because it suggests to me that there is a possibility to extract a “fuzzy decompression” of weights to their original token associations.

That's interesting! I guess the question is how did they detect or simulate a model hallucinating in that regard?

Do you have a link to that article? I can't find anything of that nature with a shallow search.

This isn't Anthropic, but here is a python library that focuses on different ways of detecting hallucinations. https://github.com/IINemo/lm-polygraph (caveat emptor, I doubt this really works).

Well, you are just directly contradicting the concrete claims made by the post so one of you is wrong...

FWIW my interpretation of this is that the hallucination vector encodes the behaviour that a the model produces bullshit despite having the facts of the matter encoded in its weights. Which is slightly different than producing bullshit as a substitute for information that it "doesn't know".

And presumably there is a second-order property here where the minimal amount of hallucination is not only bounded by the model's "knowledge" but also its implicit "meta-knowledge", i.e. the "accuracy of the hallucination vector".