Google has the leading model on pretty much every metric

Correction: Google had the leading model for three weeks. Today it’s back to the second place.

press X to doubt

o3-mini wasn't even the second place for non-STEM tasks, and in today's announcement they don't even publish benchmarks for those. What's impressive about Gemini 2.5 pro (and was also really impressive with R1) is how good the model is for a very broad range of tasks, not just benchmaxing on AIME.

I had a philosophical discussion with o3 model earlier today. It was much better than 2.5 pro. In fact it was pretty much what I would expect from a professional philosopher.

I'm not expecting someone paying $200 a month to access something to be objective about that particular something.

Also “what I would expect from a professional philosopher”, is that your argument, really?

I’m paying $20/mo, and I’m paying the same for Gemini and for Claude.

What’s wrong with my argument? You questioned the performance of the model on non-STEM tasks, and I gave you my impression.

Writing philosophy that looks convincing has been a thing LLM do well since the first release ChatGPT back in 2022 (in my country back in early 2023, TV featured a kind of competition between ChatGPT and a philosopher turned media personality, with university professors blindly reviewing both essays and attempting to determine which was whom).

To have an idea about how good a model is on non-STEM tasks, you need to challenge it on stuff that is harder than this for LLMs, like summarization without hallucination or creative writing. OpenAI's nonthinking model are usually very good on these, but not their thinking models, whereas other players (be it Google, Anthropic or DeepSeek) manage to make models that can be very good at both.

I've been discussing a philosophical topic (brain uploading) with all major models in the last two years. This is a topic I've read and thought about for a long time. Until o3, the responses I got from all other models (Gemini 2.5 pro most recently) have been underwhelming - generic, high level, not interesting to an expert. They struggled to understand the points I was making, and ideas I wanted to explore. o3 was the first model that could keep up, and provide interesting insights. It was communicating on a level of a professional in the field, though not an expert on this particular topic - this is a significant improvement over all existing models.