Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.
Same here, 2.5 Pro is very good at coding. But it’s also cocky and blames everything but itself for something not working. Eg “the linter must be wrong you should reinstall it”, “looks to be a problem with the Go compiler”, “this function HAS to exist, that’s weird that we’re getting an error”
And it often just stops like “ok this is still not working. You fix it and tell me when it’s done so I can continue”.
But for coding: Gemini Pro 2.5 > Sonnet 3.5 > Sonnet 3.7
Weird. For me, sonnet 3.7 is much more focussed and in particular works much better when finding the places that needs change and using other tooling. I guess the integration in cursor is just much better and more mature.
This. sonnet 3.7 is a wild horse. Gemini 2.5 Pro is like a 33 yo expert. o1 feels like a mature, senior colleague.
I find that Gemini 2.5 Pro tends to produce working but over-complicated code more often than Claude 3.7.
Which might be a side-effect of the reasoning.
In my experience whenever these models solve a math or logic puzzle with reasoning, they generate extremely long and convoluted chains of thought which show up in the solution.
In contrast a human would come up with a solution with 2-3 steps. Perhaps something similar is going on here with the generated code.