Cherry-picking is fun but most of them are real, verifiable facts that the models get... straight up wrong.

> 3c24b5fe "Debian Security Advisory DSA-180-1 describes a buffer overflow vulnerability involving Cyrus SASL usernames." TRUE Mostly True FALSE FALSE FALSE

This is false: https://lwn.net/Articles/13296/

> 801cb8c1 "Equal Measures 2030's 2024 SDG Gender Index provides a downloadable dataset that includes a field labeled 'required annual change'." TRUE Mostly True TRUE FALSE FALSE

This is false: https://equalmeasures2030.org/2024-sdg-gender-index/

This is the "confidently wrong" problem, and the reason that LLMs won't ever be taken seriously for anything but a few niche use-cases (like generating slop-code and pumping out marketing materials), where being wrong isn't the end of the world. Akin to how speech-to-text is wrong often enough that, while being a fun novelty, you don't see business units writing reports in Word using STT.

I would encourage everyone to skim through the real 1000-question dataset: https://lenz.io/research/llm-disagreement/data.csv

If the LLMs in this particular exercise were allowed to answer "I don't know" I expect they would have.