For those questions, it wouldn’t surprise me at all if five well-educated intelligent humans disagreed on over two out of three of them.
I would answer “don’t know” on many, but that’s not an option.
For those questions, it wouldn’t surprise me at all if five well-educated intelligent humans disagreed on over two out of three of them.
I would answer “don’t know” on many, but that’s not an option.
Yes, inter-human-annotator disagreement is also high on similar type of questions (AVeriTeC) - inter-panel agreement: κ=0.619. Tried giving the models a fifth option, Abstain, but some models seem to use it to "avoid answering hard questions" more than others.