That's Grok 4.2 not 4.3 right?
And why are you comparing to gpt-4.1? (As opposed to one of the 6? model releases since then - would have expected gpt 5.5)
That's Grok 4.2 not 4.3 right?
And why are you comparing to gpt-4.1? (As opposed to one of the 6? model releases since then - would have expected gpt 5.5)
Good catch, there was an issue with the second hardest thing in programming (caching).
Here's an updated eval with the proper models https://a3bmfqfom3.evvl.io/