Are you using Claude Code or a different agent? I'm curious how screenshots are being fed back into the model? Does CC register a tool for this, or is Fable just using a bash tool to perform the screen capture, and then what tool is it using to request the resulting image to be fed back to it?

Claude Code can process images by reading the files. And as I found out the other day, it also knows ffmpeg well enough to process videos even though it has no native video capabilities...

While debugging, it asked me to pass it a video from the past testing, proceeded to generate a "contact sheet" of the video using ffmpeg, interpreted the image to figure out which frames it needed, and extracted the full size frames and extracted the relevant text from it and used it to reproduce the problem with Playwright...

It would be interesting to know if examples like this are things they explicitly trained it to do (presumably via RL), or if any of it is emergent. I'd have to guess trained, but in any case still impressive the lengths it will go to!

It's hard to tell. Training it with lots of examples of ffmpeg would not be surprising, and training it on screenshots would also make a lot of sense. It's not inconceivable at all they'd train it on "figure out a video by creating contact sheets". The whole end to end I'd consider less likely, but it'd also be a very small leap once you have the elements.

I think a lot will fall out naturally from relative modest levels of reasoning plus in-depth knowledge of what common tools will do. E.g. I also have used Claude to debug my compiler, and it knows gdb so much better than me that even though I know it's pretty useless at holding context through reading an assembly listing (lack of structure, I suspect), it's surprisingly good at working things out by just being good at exploiting a powerful tool.

I was using the Claude Code CLI harness. It can "read" any image file on disk, so all it needs is a way to create a file in one of the standard formats supported by the Anthropic API.