An enduring, confounding quality of LLMs is that even minor differences in prompting content and style, harness type and environment can lead to radical differences in the output and perceived performance and ability. In my environment and in my "style", Fable has been a huge step up, to the extent that I am seriously considering paying for a second $200/m account just to get more usage out of the next 10 days. I'm also starting to prepare my organization for what I now see as the completely inevitable end of human-written code.
All that said, considering Anthropic's heavy-handed nerfing I'm not surprised Fable did poorly in a security-focussed benchmark. And this benchmark seems poor anyway - penalising a model for "cheating" by knowing the answer from its training data? That's not the model's fault, that's a lazy benchmark.