Hacker News

Golden Gate Claude, and abliterated models, plus Deepseek's censoring of Tianamen Square, combined with Grok's alternate political views imply that these boxes are somewhat translucent, especially to leading experts like Ilya Sutskever. In order for Grok to hold alternative views, and to produce NSFW dialog while ChatGPT refuses to implies that there's additional work that happens during training to align models. Getting access to the source used to train the models would let us see into that model's alignment. It's easy enough to ask ChatGPT how to make cocaine, and get a refusal, but what else is lying in wait that hasn't been discovered yet? It's hard to square the notion that these are black boxes that no understands whatsoever, when the original LLama models, which also contain the same refusal, have been edited, at the level of a program, into abliterated models which happily give you a recipe. Note: I am not Pablo Escobar and cannot comment on the veracity of said recipe, only that it no longer refuses.

https://www.anthropic.com/news/golden-gate-claude

https://huggingface.co/mlabonne/Meta-Llama-3.1-8B-Instruct-a...