Things I learned from this:

- Fable will do a whole lot more than you might expect in order to verify a fix. I learned that it's "relentlessly proactive". That's a good title for a blog entry!

- You can take screenshots of a window in macOS using the "screencapture" CLI command, but you'll need the integer window ID first.

- That windowID is accessible via "Quartz.CGWindowListCopyWindowInfo(Quartz.kCGWindowListOptionOnScreenOnly, Quartz.kCGNullWindowID)" using the pyobjc-framework-Quartz library, which installs cleanly via "uv run".

- A neat trick for simulating keyboard shortcuts is to run document.dispatchEvent(new KeyboardEvent("keydown", {key: "/", bubbles: true})); after the page loads.

- You don't need Flask or Starlette to run a CORS-enabled localhost server for capturing JSON from another window - 19 lines of code against the Python standard library http.server package works just fine.

- getComputedStyle(document.querySelector("navigation-search").shadowRoot.querySelector("textarea")) works to read dimensions from inside a Web Component's shadow DOM.

- defaults write com.google.chrome.for.testing AppleShowScrollBars Always

- Claude Fable knows how to apply all of the above. It's always interesting to pick up hints of what a model can and cannot do.

I'm always confused at how many people equate using a coding agent to solve a problem with "learning nothing". If you pay attention to what it's doing you can learn so much!

Sorry that wasn't a criticism of you!

I completely see how it was misread that way. I would edit it now if I could.

I was using you more as an example of a hypothetical programmer using it in this way. If the goal is to create a maintainable product, this isn't a great approach. If the goal is to learn about the model and its behaviors itself, of course this is a fantastic way to experiment. Yes, you might have learned a lot of tricks as a side effect, but avoiding the pain of thinking about, finding and hiding the thing may mask a better abstraction that reduces complexity and allows the project to move forward faster.

Honestly my goal is to learn how to teach an agent to build a maintainable product, so I'm way more interested in the learnings at the agentic level (how to prompt/direct/manage context/restrict tool use, provide reusable shims, etc) than getting into the details of a css bug. That's just not a level of abstraction with sufficient leverage for what I'm trying to do.

I stopped coding a while back because I could have more impact directing a team of developers than writing code personally.

For my use case, the agents are now how I can have that scaled impact.

Absolutely. All of these "but you could have done that easily" from frontend developers or backend developers or systems engineers -- like yea, if I have the time or interest in those things, sure. But I don't. I care about an end product way way more. Blows my mind that there are legions of people building things that they don't think are important enough to get to the finish line quickly and efficiently.

> If you pay attention to what it's doing you can learn so much!

I think your post is fair but it's worth pointing out that learning via watching is much less effective than learning via doing.

I used to believe that was universally true, but then I learned about the "worked-example effect": https://en.wikipedia.org/wiki/Worked-example_effect

Your link mentions the expertise reversal effect where the redundancy of worked examples can actually hamper an experienced students abilities, vs. letting the more experienced student work it out for themselves.

It leads to less cohesive shared vision on how to solve problems. In groups where I am trying to foster a shared technical vision, I try to get people to do “see one, do one, teach one” for procedures that are common enough to come up repeatedly (and as a method for discovery for where automation would be a bigger win). Pure green-fields software dev sometimes is doing such novel things that that doesn’t work well, but much of routine software maintenance is discovery of the steps needed to add a new flow or a new customer type or a new configurable behavior, which benefit from consistency.

The whole saga is kind of nuts, but the thing that fascinates me most is that Fable got this far and then hit some kind of guardrail; I'd be very curious to know what it wasn't able to do that caused it to downgrade to Opus.

It already got extremely... invasive? It didn't do anything that I wouldn't have approved in the same case, but it's interesting that it got as far as launching browsers, inspecting every open window, and storing screenshots to disk, and then it was stopped by something? I wonder what.

It feels like there should be a budget approval, in that particular case $12 worth of KW/h - tokens were spent, without a clear approval.

Opus also do this kind of tehcnically competentent but dumb deviations to fix a simple issue where asking for input would be better. Models have no illative sense.

It was only pursuing the goal you gave it - Keep Summer Safe.

[deleted]

"Oh my God"

I relent to snarky Rick and Morty quotes because I don't know that it's useful any more to try to explain paperclip optimizers or alignment to a bunch of AI nerds who saw the cliff coming and clawed at each other trying to be the first out to leap over the edge.

"Relentlessly proactive". That's one word for it. We have a whole subgenre of hard takeoff scenarios and it wasn't enough warning against "Relentlessly proactive".

Turns out Frank Herbert was an optimist, and we're literally pinning our survival on robots turning out to naturally have impractically short attention spans.

> Turns out Frank Herbert was an optimist, and we're literally pinning our survival on robots turning out to naturally have impractically short attention spans.

Some people are working as hard as they can to increase it though.

It sounds like you learned lots of things related to the tool, but not so much about the problem that you were using the tool to solve?

Is that fair? Not trying to snark? I see similar results myself

Learning doesn't happen in a vacuum. Even pre-LLM days where I'd scour stack overflow for the solution to one problem, I'd inadvertently learn other random stuff while looking.

Yes, that's entirely fair.

That's a lot learned about debugging, sure, but it's worthwhile to note that it doesn't tell you much about the abstractions used to build Datasette, as the previous commenters pointed out.

I designed those abstractions myself.

Are you using Claude Code or a different agent? I'm curious how screenshots are being fed back into the model? Does CC register a tool for this, or is Fable just using a bash tool to perform the screen capture, and then what tool is it using to request the resulting image to be fed back to it?

Claude Code can process images by reading the files. And as I found out the other day, it also knows ffmpeg well enough to process videos even though it has no native video capabilities...

While debugging, it asked me to pass it a video from the past testing, proceeded to generate a "contact sheet" of the video using ffmpeg, interpreted the image to figure out which frames it needed, and extracted the full size frames and extracted the relevant text from it and used it to reproduce the problem with Playwright...

It would be interesting to know if examples like this are things they explicitly trained it to do (presumably via RL), or if any of it is emergent. I'd have to guess trained, but in any case still impressive the lengths it will go to!

It's hard to tell. Training it with lots of examples of ffmpeg would not be surprising, and training it on screenshots would also make a lot of sense. It's not inconceivable at all they'd train it on "figure out a video by creating contact sheets". The whole end to end I'd consider less likely, but it'd also be a very small leap once you have the elements.

I think a lot will fall out naturally from relative modest levels of reasoning plus in-depth knowledge of what common tools will do. E.g. I also have used Claude to debug my compiler, and it knows gdb so much better than me that even though I know it's pretty useless at holding context through reading an assembly listing (lack of structure, I suspect), it's surprisingly good at working things out by just being good at exploiting a powerful tool.

I was using the Claude Code CLI harness. It can "read" any image file on disk, so all it needs is a way to create a file in one of the standard formats supported by the Anthropic API.

It's like saying you can learn so much about math from using SymPy to solve equations. Yes, you probably can. If you pay close attention to what is happening and can integrate the techniques being used into your knowledge.

But your learnings here are what, a handful of hacks? For most people it's like being shown the chain rule (which frankly, is more general than any of these learnings) without knowing what a derivative is. It's knowledge that comes context free. And even when it can be understood, I'm not sure I believe it gets integrated especially well when you did none of the work to understand it. If you are extremely diligent and self-aware about what your limitations are, and careful to be sure you have an understanding of this knowledge, sure I guess you can learn a lot.

And ultimately what do you think is more likely? People using the experience of using these tools to progress their knowledge or for them to rely on the answers uncritically? I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.

> I think people with a rosy view about this are severely undercounting the problems associated with the trust relationship between a person and an LLM and what that means.

Personally I think the impact of LLMs on children's education is a crisis right now.

Kids are not going to learn to write if an LLM writes their essays for them. And writing is how you learn to think.

> writing is how you learn to think.

There's also reading. A lot of reading can substitute some writing.

EDIT: Actually, I'd say that at first you need to do a lot of reading and _then_ writing can help your thinking as well.

I don't think it's just a problem for kids! I think this is problem for many software engineers as well! Adults of all professions really.

[dead]

[flagged]

And Fable is still worse than Codex.

I use both and the only thing (as always) that I will use Claude for is UI design.

Opus 4.8 and now Fable are still both worse at actually getting the job done than the Codex model. Claude models write FAR too much code when it's not needed, they burn far too many tokens, when they are not needed, write un-necessary tests, write plans which are 5 pages longer than are needed, etc. etc.

Have you actually compared code quality and plan quality versus Codex? It's demonstrably worse.

I don't know what problems you're working on but Fable is not just better, it is a step change from GPT 5.5 in my experience. It feels at least one major model generation ahead.

One Hacker News commenter says it's worse, another retorts it's a step change and even includes emphasis! Will the first commentor retort back that it's been a double dog step change in the opposite direction? Can't wait to see how this comment thread unfolds!

It doesn't for me. I use Fable to make plans, then give them to GPT 5.5 to review, and it always finds flaws and edge cases that Fable misses (some are really critical). It was the same with Opus 4.8. I'll admit it finds a bit fewer issues now, but Fable feels more like an incremental improvement than a major generation ahead.

For that test you have to compare letting a fresh agent (subagent) or the same model do the same review.

The fact that a review helps does not prove the model choice for the review.

You reviewing your own writing helps too!

This is exactly what I find too, I make plans in both models and compare them in the other model. And Claude usually agrees (65-80% of the time) that the Codex plan included things it didn't think of, or was better in some other way.

Note, this is better than it was with Opus, where it was more like 90% of the time the Codex plans were obviously better.

Curious, which model do you use for Codex? I'm very happy with the solutions '5.5 high' finds. It's like it understands exactly what I mean and it also anticipates all sorts of situations. Before I used '5.5 medium' for some time and it was a bit underwhelming. It may sound funny but it's like it didn't care that much to do a good job.

I use GPT 5.5 High Fast, I often benchmark versus Fable (and previously did versus Opus) and it's night and day.

Claude still (and has always) writes far too much code to fulfill a given spec or plan. It misses edge cases and is generally far too verbose.

Claude also is (and even more so with Fable) super tokenmaxxing, i.e. it seems tuned to use the max amount of tokens per task, whereas Codex will simply get your job done as you specified with the minimum fuss and tokens.

Codex feels way more steerable and just more "professional" as though I'm working with a seasoned engineer, versus someone smart but over excitable, like a super smart associate engineer.

What are your harnesses? Do you have the same skillsets/tools/etc for both?

I use Codex and Claude Code. I've used both Codex and CC since release with basically every model they've ever released, I always try both for almost every plan that I write and benchmark the plans against each other, Claude almost always acknowledges that the Codex plan is better! Even now with Fable, this still happens.

As in, I give the exact same prompt to Fable and GPT 5.5 Pro, then produce the plans, then give each model the other's plan. Claude always realizes it missed stuff and Codex usually ends up finding missing things in Claudes plan.

This situation did improve with Fable versus Opus 4.8, but in general, Codex for me is still the better model.

In my experience writing about 50 programs with fable, opus, and GPT, fable is a significant step change better than opus which is significantly better than GPT. We must be doing different things.

From what I’ve seen all three are close enough that I would be hard pressed to pick one. It seems to matter much more how I prompt than which of the three I am using.

I'm writing low-level Rust, distributed systems, also sandboxing tech which has to be secure and performant.

The only thing I have Fable do now is create UIs or otherwise front-ends for systems where correctness doesn't matter as much.

Anthropic models lead at making nice looking UIs for sure, but when it comes to making sure my Rust code is actually 100% correct and uses 1% of CPU most of the time, Codex is king.

definitely not in my experience. I usually write distributed systems and back end code, and Fable is so much better at those than Codex that it's not even a comparison. Fable feels like it's a year ahead.

Interesting, I’d love to see the comparisons of your system using Claude vs Codex. I have about 20 years of experience in distributed systems and super high scale at several faangs, and also building ai model serving infra for 20k transactions per second roughly.

For me, Claude makes bone headed decisions all the time, like glaring errors, not even particularly subtle.

But the more obvious flag is the amount of irrelevant code and tests which Fable writes. Like it regularly writes 2X or 3X the amount of code and tests that are needed. It’s an expert at writing plausible but entirely useless tests.

But I think that if you’re a more junior engineer or haven’t been around a the block you can easily think that “more code equals smarter”. Claude ends up creating a massive, hard to manage codebase, and if you look the Claude Code codebase (which was leaked), you can see I’m right!

The Claude Code codebase is terrible. And presumably Anthropic has been using their smartest models for working on Claude Code. I wrote my own coding harness with Codex (as a fun experiment) which used a fraction of the code and is about 100X more performant and memory efficient (than Claude Code)!