MMLU-Pro:

Gemini-3.1-Pro at 91.0

Opus-4.6 at 89.1

GPT-5.4, Kimi2.6, and DS-V4-Pro tied at 87.5

Pretty impressive

Funny how Gemini is theoretically the best -- but in practice all the bugs in the interface mean I don't want to use it anymore. The worst is it forgets context (and lies about it), but it's very unreliable at reading pdfs (and lies about it). There's also no branch, so once the context is lost/polluted, you have to start projects over and build up the context from scratch again.

The sheer number of bugs and lack of meaningful improvements in Google products is a clear counterargument to the AI bull thesis

If AI was so good at coding, why can’t it actually make a usable Gemini/AI Studio app?

I think Google might just be institutionally incapable of making good UX

Most of these tests are one-prompt in nature. I've also noticed issues with the PDF reader in Gemini which was very frustrating, although it is significantly better now than it was even two weeks ago. On the contrary, now GPT-5 seems to be giving me issues.

In my experience, Gemini is the most insightful model for hard problems (particularly math problems that I work on).

I gave up on Gemini 3.1 Pro in VSCode after 2 hours. They fully refunded me.

Yeah if I could use Gemini with pi.dev that would be my choice. But Gemini CLI is just so, so bad.