ELO scores for OCR don't really make much sense - it's trying to reduce accuracy to a single voting score without any real quality-control on the reviewer/judge.
I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.
It is missing both models that I mentioned, so yes, I would say one reason it is not accurate is because it is so incomplete.
It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.
A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.
That leaderboard is definitely one of the ones that leaves a lot to be desired.
ELO scores for OCR don't really make much sense - it's trying to reduce accuracy to a single voting score without any real quality-control on the reviewer/judge.
I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.
It is missing both models that I mentioned, so yes, I would say one reason it is not accurate is because it is so incomplete.
It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.
A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.
That leaderboard is definitely one of the ones that leaves a lot to be desired.