It looks like the author is specifically avoiding model's name, because results are really weird.
Opus 4.8/4.7 scored 28%
Opus 4.6 score 37%
So the author thought as let's not get into that just write Claude.It looks like the author is specifically avoiding model's name, because results are really weird.
Opus 4.8/4.7 scored 28%
Opus 4.6 score 37%
So the author thought as let's not get into that just write Claude.
Not weird at all, given the variance in Opus' quality over the last few months.
wild guess - I wouldn't be surprised if Opus 4.6 was run quantized for a while, and 4.7/4.8 have QAT for that nerfed size.
many people think opus 4.6 was the best
Hello! Author here (Katie) Ty for your comments, 4.6 and 4.7 both scored 28% on our benchmark, I just wanted to have 10 things in the list because I wanted a round number.
Where is the weird part?