Hacker News

It's a funny detail to skim, but what's more surprising is how mechanistic interpretability and alignment science have much better tools and research than the goblin blog post suggests, including from OpenAI's own alignment team:

https://alignment.openai.com/argo/ (finding what the reward models are actually encouraging) https://alignment.openai.com/sae-latent-attribution/ (what model features drive specific behaviours, presumably this would be great for goblin hunts) https://alignment.openai.com/helpful-assistant-features/ (how high level misaligned personality shows up when fine-tuning on bad advice).

It's weird that the goblin post doesn't seem to draw upon these tools.

Anthropic's recent emotions paper shows how broad the functional emotions are, even finding specific emotions firing before cheating (!): https://transformer-circuits.pub/2026/emotions/index.html

I hope their alignment researchers aren't too annoyed by the Goblin post, it seems oddly siloed!