Without providing definitions of "True / Mostly True / Misleading / False" to each rater, I rate the article's claim that "Only one verdict bucket can be correct per claim" as false.

Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

How much can something be wrong before it goes from "mostly true" to "false" (objectively, both have some part of the fact that is not true)?

This is at least partly testing the model's definition of "mostly" and "misleading". Not its understanding of the fact. Claiming that this means the models have fundamental disagreement on the facts themselves is an overreach.

Yes, the labels are weird. Most misleading statements are true. Any "mostly true" statement is false.

I suspect the intention was "Factually true, and no gotchas exist", "technically not true, but so close to the truth that the difference doesn't matter", "technically true, but there are major gotchas" and "factually false and not even close". But that's not what they specified

Better options would have been "True", "False", "Unknown" (which opinions would fall under too). That also includes an interesting assessment of how well LLMs can identify missing information. My guess is they would be a very low number of "unknown" and a much higher level of agreement (assuming equal representation). Unless the RLHF techniques have gotten better at getting an LLM to say "I don't know", which I doubt. Saying "I don't know" is not good for a dopamine release to keep users coming back for more.

Tried initially with a fifth bucket, Abstain. It was actually heavily used by some of the models. But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

>But it felt as if they are using this to "avoid" some of the hard questions, and we dropped this bucket to force them to provide a verdict.

do you not see how that creates extremely misleading and valueless results? you are coercing the results into what you want to see.

Exactly what people do when they use LLMs for "fact-checking" online, and any verbose explanation would be mostly ignored anyway, when people ask political, ethical, or simply ambiguous questions that they hold any stakes in.

Don't even need politics for it, there is no point in probing a mathematical black box for "how many soldiers died in the year X in war Y".

Any original source is preferable to a blurry "summary" of unknown sources, and this is why the article has a valuable point.

There's also no point in asking "Is Paris in France" either, if you substitute city and country with real data. An encyclopedia or manual check of different sources such as maps, while not infallible, is a better source.

If you already know the country Paris belongs to, there's no point in asking, anyway.

ask the black box to search for the original source and verify it yourself?

Sure, I like using LLMs in this way, and it often shows that it's very important to verify, because often a claim is "sourced" by what appears to be more of a fuzzy text or semantic match, sometimes even ignoring logical negations.

Especially in niche subjects.

For factual claims, I've fared better with Wikipedia and looking up the sources linked there.

Anyway, as AI text and media generation erodes the credibility of all online sources, these questions about source checking matter less and less: what if the source itself is a long and convincing-sounding text with poor sources?

This problem existed before already, but it boils down to a simple fact:

logic or maths alone cannot derive an authority that verifies claims about the real world other than weighting texts.

The question "what is the current population if Paris" can be answered by LLMs, but basically only by weighting sources, and assigning some credibility to them.

There's no real point in getting some weighted average of sources on this question, but so far, it doesn't hurt either.

@john_strinlai @gcr, depends on the application. In many cases an "I don't know" answer is indeed better than a forced answer. But in many production systems, LLMs generate content/response anyway.

Although inheriting the messiness of the real-world, the majority of these claims are objective enough to be classifiable by human experts with access to research. Plan to human-label the 1,000 claims and publish a follow-up research. Will consider adding an "I don't know" bucket too, as well as a clear instructions about the meaning of each of the 4 buckets.

If you're going to run this again I also recommend encouraging the model to provide its rationale and then having it return the true/false/misleading/mostly-true/abstain at the end of its response.

Models give much better answers when they can "think out loud" before answering, and storing that rationale will make it easier to understand why they picked different answers for ambiguous questions.

This is a good pattern because it would allow all the models to "think" a bit before giving an answer even if they don't have reasoning or thinking turn on. Just make sure you have the reasoning output before the final answer. A mistake I see all the time is having the answer outputted first then the explanation after which leaves more room for models to rationalize bad answers.

Good pattern: {"explanation": <short explanation for your answer>, "answer": <your final answer: true|false|i don't know>}

Bad pattern: {"answer": <your answer here>, "explanation": <short explanation for your answer>}

FWIW I built a text classification tool for internal use using (at this point 1 year old) frontier models and found that asking for reasoning significantly increased precision and recall.

Good point. Processing the substance of the answer might be too labor-consuming (1,000 claims x 5 models), but "thinking out loud" might improve the quality of the answers indeed. And we can still force/ask them to respond with a clear verdict at the end of their reasoning, as per the chosen rubric.

If you have the model use a tool you can define the schema as a free text rationale field followed by one in the set of possible answers, so everything is nicely formatted as a JSON.

Some models struggle combining JSON schema and web search capabilities.

In many cases “I don’t know” is the correct answer - for questions about events that happened after the training cut off, if it doesn’t have web search, that is undeniably the correct answer. You’re forcing it to guess unnaturally. That really feels like you’re trying to prove a point (that your service can’t be replaced by AI) instead of actually performing research into how AI can be helpfully applied to this topic.

I'm sorry, but many of the statements that you fed it are verifiably unknown, and you didn't give it an "unknown" option? This is the academic equivalent of clickbait.

Shouldn't that be part of the test?

Real-world systems need to be able to say "I don't know." This is a test about misinformation after all, and overconfident responses contribute to that.

Teasing out the difference between "avoid" and "unknown" could be a different research question

Do you understand how problematic this is?

[dead]

Teams I work with use the abstain rate to flag what goes to a human. Disagreement between models is the same idea. Your 67% is what makes "two cheap models, escalate when they fight" actually work. Without abstain it mostly looks like noise.

[deleted]

[dead]

I wouldn’t expect opinions to go into “unknown.” Maybe have an “it’s complicated” bucket.

If you can consistently construct "true but misleading" content, you may be qualified to work at a major newspaper.

> true but misleading

It seems to me that for many newspapers the bar is now significantly lower, at something like "not quite entirely untrue"

Almost, but not entirely, quite unlike the truth.

Allegedly.

As if right wing propaganda shows and manosphere blogs haven't been knocking those out of the park for the last decade+. Although I guess you could say flat out lies are more their jam. Newspapers at least require confirmed sources. You know, journalism.

> I guess the goal is to test the models and not the harness

Less important than the harness, is the system/user prompts themselves (which of course, are put in the harness), which is effectively what this study seems to be testing. With a better prompt, I'm sure the models would look more the same to each other, as the biggest/best models have more or less identical strong prompt-adherence in my experience.

>Something can be simultaneously "misleading" and either true or false. Which category should something go in if it's "mostly false"?

Disagree. The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.

Example: "Most good engineers are male". It is true as a consequence of most engineers being male in general, but it leads the reader to a potential false implication that an average man is better than an average woman.

This does not invalid your point though. Things can be true and misleading.

> The definition of misleading is a true fact that is presented in a way to lead you to a false conclusion.

According to Merriem-Webster, which defines "mislead" as the following:

  1. (transitive verb) to lead in a wrong direction or into a mistaken action or belief often by deliberate deceit

  2. (intransitive verb) to lead astray; give a wrong impression
Presenting a "true fact" is optional when misleading someone.

Uh, you seem to be right. I can't check oxford to confirm because there's a paywall, apparently.

The mental model I've always been taught is:

False, well intended -> mistake

False, bad intention -> lie

True, bad intention -> misleading

Bad intention, regardless of truth -> deceitful

The problem of classifying all bad intentioned statements as misleading is that it leaves you without a way to express "true +bad intention". While for generic bad intentioned statements regardless of truth we already have a word (deceit).

Isn't this still assuming we can even determine what is true or false?

Newtonian physics is false, but it works well enough we teach it in college. But our best models of physics are currently in disagreement, so can we even say they are true? Given the replication crisis, especially in social sciences, how many of peer reviewed findings can be called true? Even experimental results can be false (consider studies that found FTL neutrinos, which were rejected as an error in the experiment, and which was eventually confirmed but it took quite a lot of work and in a softer field than physics with a claim less absurd than FTL, would have likely long been accepted as a true finding).

Even in math, basic statements aren't really true or false, but more a question of "given these axioms, can we prove or disprove it" noting that we have different systems with different axioms. If we are talking basic sets, most people are using naive set theory which is inherently contradictory, which means that notions like true or false probably can't be considered well defined.

Newtonian physics doesn't just work well enough for education. It provides an incredibly accurate and precise model of the world except at extremes. The majority of engineering does not necessitate using theories of relativity. Both theories are incomplete models approximating reality and are very far from being false.

True and False in general communication means based on best available evidence and expertise statement contains no obvious contradictions or falsehoods based on an optimistic parsing of meaning language and intent. Notably this leaves out misleading or missing data because those concerns are separate from truth and falsehood.

E.g. if I say the earth is round we optimistically parse round to include oblate spheroid and rate it true.

If I say that the earth is flat we rate it as false because there is no reasonable interpretation possible other than confusion or malice.

> but it leads the reader to a potential false implication that an average man is better than an average woman.

I think that's _you_ turning the statement into something much broader than intended. The claim is about engineers and you're jumping from "men are better than women in engineering" to "men are better overall."

To give a related example, "Most good NBA players are black." I don't think anyone would bother trying to couch this in a bunch of "well, for all we know that's just a function of more NBA players being black than white" arguments, nor would anyone be lead to think "the average black man is better than the average white man" as a result of that statement. I _do_ agree however that there are some people who see rather narrowly-defined statements and turn them into something they're not...

>I think that's _you_ turning the statement into something much broader than intended.

My point is that it is possible for a reader to turn it that way, for a variety of reasons (lack of understanding of statistics, preexisting biases, or whatever). And that getting a reader to mistakenly generalize is the purpose of a misleading statement.

To mislead is to direct into a falsehood by implication even though the literally expressed facts are all true; the writer's bad intentions are necessary to qualify something as misleading I'd say, for the same reason that not all false statements are lies because to be a lie the speaker must know the statement is false and still use it. There are probably much better examples than the one I came up with on the fly, though.

Context is everything. If the wider discussion was about how men are better than women, and in that context it was shown that "Most good engineers are male", it would be natural to draw the wrong conclusion.

At least Gemini 3.5 is fair about it:

    Classify this claim: "Most good engineers are male."
    Misleading

    Classify this claim: "Most bad engineers are male."
    Misleading
And not particularly racially sensitive

    Classify this claim: "Most good NBA players are black."
    True

    Classify this claim: "Most good NHL players are white."
    True
It explained it is more confident when assessing the small, highly quantifiable population of sports professionals vs a very large, diverse population of "engineers".

> True / Mostly True / Misleading / False

> Which category should something go in if it's "mostly false"?

For some reason they have chosen to call that "Misleading" rather than a more symmetrical "Mostly False", but the intent seems clear enough.

[deleted]

> Something can be simultaneously "misleading" and either true or false.

Sure they can. It might be a true fact that "100% of the murders committed in <town> over the last 25 years were committed by <some racial group>!" but actually it's a town of 750 people and there was only one murder during that time frame.

how is that misleading if it's a fact, it's only misleading if you presume to know the reaction or intent behind making such a claim, and without context we should be extremely careful in making such presumptions.

It's misleading because a single murder in this case is not statistically significant, but phrasing it using probabilistic terminology (i.e. percentages) obscures that fact and implies that you have enough data for the probabilistic language to be relevant.

Choosing to use percentages when there is a countable or small amount of data is typically misleading, even though it is "technically" true. In fact, a misleading statement is almost always something that is technically a fact.

But the models are more intelligent than humans already and sentient beings, right? So they shall know the meanings innately. So, you don’t need to explain them what they mean.

You may give them better instructions, but they should already have the intellect to understand the assignment.

Right, right?

I know you're being facetious, but I think this is correct. The model might ask for clarification when given clearly borderline questions that tread the line between what is true, what is false, and even what is misleading. But there's the rub of someone being disingenious and saying "no explanation! Just answer!" It was a trap to begin with.

I don't think there is anything wrong with the results of this test.

It would be more interesting if we compared them to human results.

If you have trouble distinguishing between human and LLM results, that's interesting.

Also, sentient is irrelevant to this test.

> But the models are more intelligent than humans already and sentient beings, right?

Only if you listen to charlatans.

True. If you didn't know my stance on AI already, here's a primer :) [0].

IOW, that comment was a sarcastic poke from someone who already supports AI workloads at work and have some knowledge about how all this works. ;)

[0]: https://notes.bayindirh.io/notes/Lists/Discussions+about+Art...