they used GPT-4.1 nano, results would be quite different with sonnet or gpt5.

I was looking for the frontier curve where they tested their benchmark across different models since this sort of behavior is highly parameter, architecture, training, and fine tuning sensitive. It’s a practically useful question so I was really disappointed when a) they didn’t publish their code so you could test yourself, b) they didn’t do even a cursory examination of other models and sizes.

Or just regular gpt-4.1, it's a quite capable model.

trust me bro, the next model bro, it's just way better bro

To be fair nano was an absolute crap model when it came out