If Claude Mythos and Fable 5 are the same underlying models just with different safeguards, I fail to see how TerminalBench has them at different scores.

Refusals, presumably.