Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.

I asked GPT-5.2 10x times with thinking enabled and it got it right every time.

Thinking or extended thinking?