Hacker News

"knowing why a model refuses to answer something matters"

The companies that create these models cant answer that question! Models get jailbroken all the time to ignore alignment instructions. The robust refusal logic normally sits on top of the model, ie looking at the responses and flagging anything that they don't want to show to users.

The best tool we have for understanding if a model is refusing to answer a problem or actually doesn't know is mechanistic interp, which you only need the weights for.

This whole debate is weird, even with traditional open source code you cant tell the intent of a programmer, what sources they used to write that code etc.