Interesting.

I tried Fable vs Codex 5.5 xhigh on three different cases.

1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.

2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.

I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.

To me it feels like they're basically tweaking these things around the edges. I'm not seeing any difference in capability just preference. This has been the case for a while.

That makes sense, its seemed to me for a while now the competing product is the harness not the model itself.

Most people thought Fable had more 'taste' than Opus, there was certainly a better quality of writing that felt more 'smart human' and not 'stochastic parrot stringing sentences together'.

I think that Obama-esque, GMAT essay format is the AI flavor that turns me off AI-written articles. It used to be good writing, but because AI locked onto it as such, it's become the watermark of AI generated content.

Oh boy, people are really going to lean into avoiding proper grammar now.

It can only last as long as it takes for AI to figure out how to chase the latest authenticity signal.

>2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

When models miss things, there is always the possibility that it has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do. The fine tuning will have them targeting a balance of subjective opinions of what is appropriate. To go beyond broad demographic guessing the model really needs to 'get to know you' to know what it means when you specifically request an action. Without that information about you it has to weigh your words against the level of sophistication it expects a standard user is able to express.

Maybe you mean that an expert will use more specific language which in turn triggers the model to give a response that more closely matches the "expert distribution"

Anthropic published a study showing that Claude does more work for the expert user, and experts have a higher rate of "successful sessions" than novices.

https://www.anthropic.com/research/claude-code-expertise

> has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do.

I guess OP should have told it more explicitly to “find all errors without missing anything.”

> Thinking. I know this user well, they don't actually want me to find all errors.

> Thinking.. But I found a smoking gun of an error with this SPICE model, maybe I should inform the user.

> Thinking... Hm, but again, I know this human well, they likely don't care about this error. That's absolutely right - it's not an assistant's job to decide this, it's the user's.

Well if you want it go go off and try and validate the spice simulator and the kernel of the operating system that it's running on then that might be an approach to use.

[deleted]

Did you use their native harnesses, or a generic one?

Native for both.