> 60.7%

Why would anyone trust the output of an LLM, if it is barely better than guessing and much much worse than humans?

GPT-5 shows more impressive numbers, but for that particular task, the precision should be 100% - always. No matter how large the data set is or in which format. Why are we doing this?