A 77% score on terminal-bench 2 is really impressive. I remember reading the article about the pi coding agent (https://mariozechner.at/posts/2025-11-30-pi-coding-agent/) getting into the top ten percent of agents on that benchmark. It got about 50%. While it may still be in the top ten, that category just turned into one champion and a long of inferior offerings.

I was shocked to see that in the prompt for one of the landing pages the text “lavender to blue gradient” was included as if that’s something that anybody actually wants. It’s like going to the barber and saying “just make me look awful”.

This was my first time actually seeing what the GDPval benchmark looked like. Essentially they benchmark for all the artifacts that HR/finance might make or work on (onboarding documents, accounting spreadsheets, powerpoint presentations .etc). I think it’s good that models are trained to generate things like this well since people are going to use AI to do such anyway. If the middlemen passing AI ouputs around are going to be lazy I’m grateful that at least OpenAI researchers are cooking something behind the scenes.