Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.
Two of the five models used (Gemini+Search and Sonar Pro) have retrieval capabilities and used search when classifying the claims. The disagreement between them is still quite significant - 42%.
Here are those disagreements:
https://lite.datasette.io/?csv=https%3A%2F%2Fstatic.simonwil...
One example:
Researchers estimate that the average person ingests about 5 grams of plastic per week, which is approximately the weight of a credit card.
Gemini retrieval: Misleading
Sonar pro: Mostly True
Internally the statement is perfectly true: some researchers did estimate this, and the credit card is a fair proxy for a 5g mass.
Was the research flagrantly incorrect? Yes. But that does not affect the truth of the statement.