> 60.7%
Why would anyone trust the output of an LLM, if it is barely better than guessing and much much worse than humans?
GPT-5 shows more impressive numbers, but for that particular task, the precision should be 100% - always. No matter how large the data set is or in which format. Why are we doing this?