This is a great example of why prompt engineering is still relevant. Without providing definitions and examples and a well defined rubric, you’re going to see different models disagree by a level in either direction. When you get more prescriptive the models tend to agree better.

I’ve experimented with AI grading for undergraduate math courses, and see basically the same thing. If you just tell the AI “grade this problem and assign a letter grade” then I’ve only seen about 30% agreement between a human assigned grade and the AI assigned grade. But over 75% agreement if you say a “match” is within one letter grade. And to get better agreement you have to spend a lot more time on the rubric- what kinds of mistakes are a big deal, what kinds of mistakes are not a big deal, how much work is required to be shown to get credit, a couple examples of each letter grade. Once you have done that, the AI gets a lot better agreement with human graders, but it is hard to know when you’ve given enough guidance for a problem.

That's a valid point. During the preliminary research, we did try also more explicit prompts (with explanation for each of the 4 buckets), as well as a five-bucket rubric (with Abstain option). Will show in a follow-up paper how the concise vs explicit prompt impacts the distribution of the verdicts and the level of disagreement. One issue to note with the longer prompts is that they open to much room for discussion around the exact prompt used. Probably we should preregister the prompt before running any further tests.

The other thing I suspect is that "Just give me True/False" cuts off a large amount of the search space a modern-day LLM uses to help it answer questions (you can see it in reasoning traces but the act of writing the explanation helps guide it toward a better answer and gives it better likelihood it backtracks on a bad decision).

If you let it spew out an explanation along with the answer, I'm curious if the accuracy will improve (I suspect it will).

Good point. Will publish in the next version also the results with a prompt that allows the models to "think out loud" before providing the final verdict.