Hacker News

saagarjha 12 hours ago [ - ]

I have found that a lot of the techniques used to decensor models (as far as I can tell, they basically get all their weights to say no turned off) also make them really stupid. Like, sure, it will help you rob a bank, but if you ask whether you should rob the bank it will go "The positives: … The negatives: … My take: You should ABSOLUTELY rob the bank".

pmarreck 9 hours ago [ - ]

The abliteration in particular that Heretic does apparently results in a best-in-class lack of "stupefying" the underlying model. You haven't read its claims, apparently.

lxgr 10 hours ago [ - ]

I wonder if this is due to abliteration actually "damaging" the model, or just an artifact of the model never having been properly trained on "forbidden" topics (as it's enough for them to recognize them, and there's no point in dedicating neurons to something that will never be exercised anyway).

zozbot234 10 hours ago [ - ]

Modern abliteration is quite good at not damaging the model on ordinary topics. But yes, on many of the weirdest "forbidden" topics (excluding the mild stuff like ordinary erotica) there's not going to be any real training of any sort and it's basically hallucinations running wild. You even see this claim repeated explicitly on every model release "safety card": 'no, this model does not have the sort of fiddly tacit know-how it would need to actually advise anyone nefarious on this dangerous stuff'.