Title says "LLMs" (plural) but they only tested one

> We only tested OpenAI’s GPT-4.1 nano.

This should be higher. While the research question is interesting, the sample size makes the conclusion highly suspect. I'd like to see more research on this.

And not even a commonly used one. Gemini Flash or o4-mini would have been a much better choice if they wanted a cheap model