I set Fable onto a couple of intermittent bugs in my React Native app that Opus had failed to solve. It came up with novel approaches for both that squashed the bugs further up the pipeline, killing baby Hitler before he could become problem. Then Fable came up with 3 more edge case bugs, and 4 code cleanups.
This matches my experience with other model quality leaps, it's greater understanding gives it more bug blasting firepower.
Perhaps setting a new model off on a 2-4 hour tasks and expecting perfect results just isn't a great test. Chunking the problem is always a better test of abilities.