> As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark
And what does opus score with "regular" browser harnesses?
> As proof, ABP with opus 4.6 as the driver scores 90.5% on the Online Mind2Web benchmark
And what does opus score with "regular" browser harnesses?
https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderb...
Hm I can't see Opus 4.6 on there
I tweeted at the OSUNLP and they're backed up on eval validation. In the meantime, here's the benchmark repo with the saved runs and also instructions on how to run it locally. https://github.com/theredsix/abp-online-mind2web-results
90% easy or 90% average?
90% average with 85.51% hard!
Nice! Will take a look at this for my homelab - was debating using crawl.cloudflare.com to try it out, as browser rendering was my next stretch goal.