Am I the only one flagging inconsistencies in the different evaluations on the 18 benchmarks ? Why is sometimes the closed frontier model grok ? And then opus 4.8 ? Compared to GLM 5.2 once or sometimes Kimi 2.6 ?