Seems like a bunch of noise. What does this even mean?
It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"
Seems like a bunch of noise. What does this even mean?
It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"
No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for.
- evaluations need to be done at the same time to avoid drift in your bias
- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?
- which one did you do first? Raters have a tendency to bias in one direction or another
- you also know the label! You know which model is which! This biases your assessment…
And on and on and on. Careful science exists for a reason.