Honestly -- the thing that has impressed me the most about Fable is how diligent it is about testing its own changes. I think this is exactly what Simon is picking up here - Fable is absolutely heckbent on screenshotting that darn scroll bar and will stop at NOTHING until it manages it! In my own use I was also impressed how it proactively installed Playwright and set it up to test a FE change. The previous models treated testing more as an afterthought, which I thought was annoying. I always had to tell them to do it, and then sometimes I would get lazy and skip it. I've noticed Fable go to similar extremes when testing other things - like actually deploying my app to exercise new APIs, etc. It makes the results much better. The downside is that tasks take much longer - but that doesn't matter because we were all using worktrees / remote control to do other work asynchronously, right? Right?
It feels to me like Fable is just a slightly more advanced Opus 4.8 (or 4.6?) but with this 'adversarial' self-challenging/checking of work and a more compute to really hunt down edge cases or to spin up many sub agents using lesser models. That's what makes it feel like a big jump, but I think the results wouldn't be so different if you manually challenged 4.6 with enough iterations of logs, screenshots, and follow up questions.
Yes I had a fun experience where it kept on timing out on a seemingly mundane task and it turned out I had written the ask in a way that was impossible to test