Only testing GPT-4.1-nano makes this basically useless. Most people are almost certainly using GPT-5 mini or better. This very poor analysis is like an LLM literacy test for readers.
Only testing GPT-4.1-nano makes this basically useless. Most people are almost certainly using GPT-5 mini or better. This very poor analysis is like an LLM literacy test for readers.
Please go away and do the work for us and let us know what anmazing accuracy you got with whatever version you think is better.
Anything below 100% is actually pretty useless when it comes to stats.
If you want 100% accuracy from these kinds of tasks with LLMs you can get it today, but you need to provide the LLM with the ability to run Python code and tell it to use something like Pandas.
You can confirm it's doing the right thing by reviewing the code it wrote.
Or you can just write the code to do it correctly. Which would be quicker. If you can review it properly you already understand how to do it.
That would require me to have memorized the pandas API.
I've been using pandas on-and-off for over a decade and I still haven't come close to doing that.
Simon is right about using code execution, but many tables one might look at outside of formal data work are small enough for LLMs to be very reliable at, so this format question is practically relevant. I wish they had tested better models.