There was, let's say, significant skepticism the last time they announced something. What's changed?

I have no idea if the evaluator themselves is trustworthy, but it was supposedly independently evaluated by Appen: https://www.appen.com/whitepapers/benchmarking-subquadratics...