Except for a few models, the selected ones were non-reasoning models. Naturally, without reasoning enabled, the reasoning performance will be poor. This is not a surprising result.
I asked GPT-5.2 10x times with thinking enabled and it got it right every time.
Thinking or extended thinking?