The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.
Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
I don't think that really proves anything, it's unsurprising that recumbent bicycles are represented less in the training data and so it's less able to produce them.
Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.
The people that work at Anthropic are aware of simonw and his test, and people aren't unthinking data-driven machines. How valid his test is or isn't, a better score on it is convincing. If it gets, say, 1,000 people to use Claude Code over Codex, how much would that be worth to Anthropic?
$200 * 1,000 = $200k/month.
I'm not saying they are, but to say that they aren't with such certainty, when money is on the line; unless you have some insider knowledge you'd like to share with the rest of the class, it seems like an questionable conclusion.
I suspect they're training on this.
I asked Opus 4.6 for a pelican riding a recumbent bicycle and got this.
https://i.imgur.com/UvlEBs8.png
It would be way way better if they were benchmaxxing this. The pelican in the image (both images) has arms. Pelicans don't have arms, and a pelican riding a bike would use it's wings.
Having briefly worked in the 3D Graphics industry, I don't even remotely trust benchmarks anymore. The minute someone's benchmark performance becomes a part of the public's purchasing decision, companies will pull out every trick in the book--clean or dirty--to benchmaxx their product. Sometimes at the expense of actual real-world performance.
Pelicans don’t ride bikes. You can’t have scruples about whether or not the image of a pelican riding a bike has arms.
Wouldn’t any decent bike-riding pelican have a bike tailored to pelicans and their wings?
Sure, that’s one solution. You could also Isle of Dr Moreau your way to a pelican that can use a regular bike. The sky is the limit when you have no scruples.
Now that would be a smart chat agent.
Interesting that it seems better. Maybe something about adding a highly specific yet unusual qualifier focusing attention?
I don't think that really proves anything, it's unsurprising that recumbent bicycles are represented less in the training data and so it's less able to produce them.
Try something that's roughly equally popular, like a Turkey riding a Scooter, or a Yak driving a Tractor.
perhaps try a penny farthing?
There is no way they are not training on this.
I suspect they have generic SVG drawing that they focus on.