Ok, explain me one thing: I have a benchmark - I feed identical prompt to multiple models. Codex produces a rough but working program. Fable produces the same - but with more bugs than Codex. Opus produces something similar to Codex but with a critical bug.

That describes all my tests with Fable.

Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

I mean, well, yes, it is impressive. It could quickly generate a lot of garbage which sorta does look like code. Two others can do the same. I don't see any groundbreaking improvement - but the price is much higher. Why the hype?

>> Why should I be hyped about all that "legitimate power" if the model performs on par with two other SoTAs?

I don't care if you're hyped or not. You asked if the posts like the OP come from a "parallel reality" and I said no and described my experience. If you're getting good/better results with Codex than with Fable, you should probably continue using that, since it's cheaper and faster.

But can you bring anything measurable in support to your words? I did.