There are quite a few comments here about benchmark and coding performance. I would like to offer some opinions regarding its capacity for mathematics problems in an active research setting.
I have a collection of novel probability and statistics problems at the masters and PhD level with varying degrees of feasibility. My test suite involves running these problems through first (often with about 2-6 papers for context) and then requesting a rigorous proof as followup. Since the problems are pretty tough, there is no quantitative measure of performance here, I'm just judging based on how useful the output is toward outlining a solution that would hopefully become publishable.
Just prior to this model, Gemini led the pack, with GPT-5 as a close second. No other model came anywhere near these two (no, not even Claude). Gemini would sometimes have incredible insight for some of the harder problems (insightful guesses on relevant procedures are often most useful in research), but both of them tend to struggle with outlining a concrete proof in a single followup prompt. This DeepSeek V4 Pro with max thinking does remarkably well here. I'm not seeing the same level of insights in the first response as Gemini (closer to GPT-5), but it often gets much better in the followup, and the proofs can be _very_ impressive; nearly complete in several cases.
Given that both Gemini and DeepSeek also seem to lead on token performance, I'm guessing that might play a role in their capacity for these types of problems. It's probably more a matter of just how far they can get in a sensible computational budget.
Despite what the benchmarks seem to show, this feels like a huge step up for open-weight models. Bravo to the DeepSeek team!
They have had the best math models for about a year most folks just didn't know about it. You can't find inference on APIs, but I run these at home, this is also the advantage of open models.
https://huggingface.co/deepseek-ai/DeepSeek-Math-V2 https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B
Vertex AI has had deep seek available via API for a while
When you say "Gemini", which exact model do you mean? You know there are several and they vary a lot in how capable they are? Pro 3.1 Preview, 2.5 Pro (their latest non-preview pro model), Flash 3 Preview, ...
Same with GPT-5: Latest 5.5, prior 5.4, or actually the original 5 (.0)?
You can't talk about model performance without specifying the exact model.
High bet on 3.1 pro. I use it a lot for math and classic engineering, it's very strong.
I reviewed how DeepSeek V4-Pro, Kimi 2.6, Opus 4.6, and Opus 4.7 across the same AI benchmarks. All results are for Max editions, except for Kimi.
Summary: Opus 4.6 forms the baseline all three are trying to beat. DeepSeek V4-Pro roughly matches it across the board, Kimi K2.6 edges it on agentic/coding benchmarks, and Opus 4.7 surpasses it on nearly everything except web search.
DeepSeek V4-Pro Max shines in competitive coding benchmarks. However, it trails both Opus models on software engineering. Kimi K2.6 is remarkably competitive as an open-weight model. Its main weakness is in pure reasoning (GPQA, HMMT) where it trails Opus.
Speculation: The DeepSeek team wanted to come out with a model that surpassed proprietary ones. However, OpenAI dropped 5.4 and 5.5 and Anthropic released Opus 4.6 and 4.7. So they chose to just release V4 and iterate on it.
Basis for speculation? (i) The original reported timeline for the model was February. (ii) Their Hugging Face model card starts with "We present a preview version of DeepSeek-V4 series". (iii) V4 isn't multimodal yet (unlike the others) and their technical report states "We are also working on incorporating multimodal capabilities to our models."
I feel like people suck at promoting Opus. Baseline, it's pretty on par with GPT 5.5.
But if you prompt it well - give it the reasoning behind why you're asking it to do something - it pulls far ahead.
Have you also tried the Pro versions of ChatGPT and Gemini (Deep Think)?
Very interesting. I wonder how much of this is due to the context length. I am unclear on the implementation strategy, you ran this problem as a 1-shot using chat mode, or using each on an agent harness?
Has nothing to do with context length, they have experience training math models, they have a model that would take gold in IMO and a lean prover. Both have been out for almost a year.
Wondering how gpt 5.5 is doing in your test. Happy to hear that DeepSeek has good performance in your test, because my experience seems to correlate with yours, for the coding problems I am working on. Claude doesn't seem to be so good if you stray away from writing http handlers (the modern web app stack in its various incarnations).
Very cool to hear there is agreement with (probably quite challenging?) coding problems as well.
Just ran a couple of them through GPT 5.5, but this is a single attempt, so take any of this with a grain of salt. I'm on the Plus tier with memory off so each chat should have no memory of any other attempt (same goes for other models too).
It seems to be getting more of the impressive insights that Gemini got and doing so much faster, but I'm having a really hard time getting it to spit out a proper lengthy proof in a single prompt, as it loves its "summaries". For the random matrix theory problems, it also doesn't seem to adhere to the notation used in the documents I give it, which is a bit weird. My general impression at the moment is that it is probably on par with Gemini for the important stuff, and both are a bit better than DeepSeek.
I can't stress how much better these three models are than everything else though (at least in my type of math problems). Claude can't get anything nontrivial on any of the problems within ten (!!) minutes of thinking, so I have to shut it off before I run into usage limits. I have colleagues who love using Claude for tiny lemmas and things, so your mileage may vary, but it seems pretty bad at the hard stuff. Kimi and GLM are so vague as to be useless.
My work is on a p2p database with quite weird constraints and complex and emergent interactions between peers. So it's more a system design problem than coding. Chatgpt 5.x has been helping me close the loop slowly while opus did help me initially a lot but later was missing many of the important details, leading to going in circles to some degree. Still remains to be seen if this whole endeavour will be successful with the current class of models.
Curious to know what kind of problems you are talking about here
I don't want to give away too much due to anonymity reasons, but the problems are generally in the following areas (in order from hardest to easiest):
- One problem on using quantum mechanics and C*-algebra techniques for non-Markovian stochastic processes. The interchange between the physics and probability languages often trips the models up, so pretty much everything tends to fail here.
- Three problems in random matrix theory and free probability; these require strong combinatorial skills and a good understanding of novel definitions, requiring multiple papers for context.
- One problem in saddle-point approximation; I've just recently put together a manuscript for this one with a masters student, so it isn't trivial either, but does not require as much insight.
- One problem pertaining to bounds on integral probability metrics for time-series modelling.
Regarding the first problem: are you looking at NCP maps for non-Markovian processes given you mention C*-algebra? Or is it more of a continuous weak monitoring of a stochastic system that results in dynamics with memory effects?
I'd be very curious to know how any LLMs fare. I completely understand if you don't want to continue the discussion because of anonymity reasons.
It would be wonderful to have a deeper insight, but I understand that you can disclose your identity (I understand that you work in applied research field, right ? )
Yes, I do mostly applied work, but I come from a background in pure probability so I sometimes dabble in the fundamental stuff when the mood strikes.
Happy to try to answer more specific questions if anyone has any, but yes, these are among my active research projects so there's only so much I can say.
Thanks a lot for your kind but detailed answer. I’m no more in the research field but you gave me good ideas to work on