Surprisingly, LLMs are actually much worse at reasoning in Python than other common programming languages for agentic coding tasks.
Data here: https://gertlabs.com/rankings?mode=agentic_coding
Surprisingly, LLMs are actually much worse at reasoning in Python than other common programming languages for agentic coding tasks.
Data here: https://gertlabs.com/rankings?mode=agentic_coding
Hah, I was just thinking that Python likely has a vast ocean of training data, but it's likely of lower quality, being much of it is written by beginners and those who aren't primarily programmers.
That's what I'm thinking too. There is a lot of noise and I know teams where the majority of the people writing Python just have no idea what they're doing.
I'm working with Clojure which is used mostly by senior engineers and it still blows my mind how well Claude writes software in it even though it's a fringe language. It's even able to pick up in-house DSLs written with macros.
Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time: sync vs. async, typed vs. untyped, scientific Python looking very different from web application code, some people really wishing it were an FP language, and others doing the clean-architecture OOP onion soup. It has gotten so fragmented.
Recently, I had a more pleasant experience using LLMs with Go. It reminds me a bit of Python 2.x, when the community seemed, in my view, more focused on embracing a stupid simple language, with everyone trying to write roughly similar "Pythonic" code.
> Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time
If there’s one language that is the prime example of this, it’s C++, and according to this benchmark it ranks incredibly high.
I’m also thoroughly confused why Kimi 2.6 scores 83% while Opus 4.7 scores 67% for C++, GPT5.5 isn’t even in the top10.
Gemma 4 31B scores 100% success rate for Python (!!) while Opus 4.6 only 65%.
This benchmark really seems to be all over the place and doesn’t make sense.
That was the hardest part of learning PHP, all the code examples online were just awful.
Worked on a PHP project once. Every time I asked why something was done a certain way the answer was "dunno, we copy pasted this code snippet."
Certain popular PHP codebases appear to use a similar methodology.
Reminds me of the time I asked Claude to write some Wordpress code for me. The results were…rough.
I was (pleasantly) surprised by Claude Code doing Raku - also with a limited training set (~2000 Stack Overflow, a bunch of Rosetta, 2,500 modules). I put this down to the quality of the code for the core community who are all frankly uber-gremlins.
Yeah Raku feels so expressive and lovely to me with the help of an AI assistant. I've only done toy programs and scripts with it but it is actually so nice.
All my vibe coded projects (personal) are Go backend services, with Typescript/React frontend. And my thoughts were based on similar things. Like why I wouldn't use PHP for that, either.
There's a broken idea that AI know Python because they're written in Python.
Not how any of it works.
Not what anyone was talking about. Training corpus ≠ inference engine.
While recent models are capable of generalizing to any language at this point, I do think there are weights from their pretraining corpus that still leak through into how they create their responses. We observed similar language performance patterns across models from different providers, btw.
I’m super surprised that C++ scores so high, this does not match our experience at all, and for anything performance critical it always drops the ball completely.
I also don’t understand how these “games” map to real world complex problems. How are you measuring success? How does “adversarial customer service” map to “this LLM is better at C++ than the other” ? How are you sure you’re not just benchmarking language suitability for a problem ?
I have so many questions about this…
- The majority of the environments can be played where the agent writes code to work the environment towards a goal. So the model is problem solving, and it has to do so in a particular language, and some languages outperform others. We have a lot of data to back up the improved compiled language performance, but note these are for successful code submissions (failures are counted in a different metric). With the Languages chart we're moreso measuring how good the ideas they came up with were, once they already compiled/didn't fail basic environment rules.
- You need to run evals at scale to converge on this kind of behavior: these benchmarks run samples across a pool of hundreds of different types of environments
- Some games are too open-ended to support code play. The customer service game is an example of that, where models are called on every tick of the environment to make a decision (that's the 'decision making' part of the evals which is weighted lowest). Very interesting results but not testing coding ability, just general reasoning.
Not sure what issues you have with models writing C++ vs other languages, but I can imagine all sorts of C++ specific bottlenecks not directly related to the model's ability to reason in the language, like the dependencies, verbosity, extra effort to manage memory, etc. I have only done a little C/embedded work since agentic coding happened but I was pleasantly surprised.
I've found the current cream of the crop to be quite good at resource management. I've sic'd Opus on some very gnarly lambda context bugs and it has directly improved the stability of the product I'm working on right now in a very substantial way. It couldn't quite do it entirely by itself, but with the right nudges here and there, it has absolutely accellerated the debugging work. It is particularly good at analyzing crashes and piecing together the detective work of what preconditions must exist for certain crashes to occur.
I think my problem is that I’m not sure I understand whether you evals are testing language abilities or reasoning abilities.
It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.
I would love to see how they do with functional languages and especially Lisps here. I've noticed pretty good performance with Emacs Lisp relative to overall model strength, but I haven't used LLMs to application code in any such languages.
It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).
Thanks for putting this together! It's interesting.
I've noticed that with clojure(script) unless you specifically instruct them to keep nesting levels low, they can hit a point where they make a paren placement error and can't debug their way out of it. Although in my case while one model made the error then couldn't find what it had done, a second model that I switched to was then able to identify it and back it out. So I suspect this is a transient weakness in today's models, not something fundamental.
That's a good idea. Would you rather see Lisp or Scala? Any interest in Prolog? We are trying to be selective to keep the data concentrated, but we will eventually add a couple more, most likely to sample different programming paradigms.
I think Clojure would probably make for a more interesting comparison because its syntax is more different from the other languages currently on there and it's less multi-paradigm than Scala is (it doesn't support OOP, it's more explicitly immutable-first). I think Scala is a lovely and cool language, but I'd be more interested in the Clojure comparison here.
Prolog night be interesting because I bet nobody is trying to train very hard on it, but I'm less directly interested in model performance with Prolog.
If you are taking request, I was hoping to see clojure on there.
My spider sense tells me the immutable-ness would help with correctness, but I'm not sure how much difference it would make in practice. Would love to see some numbers.
A relative lack of training data might have a bigger effect though.
Just last night I was going down the rabbit hole of "what's the best programming language to use for vibe coding." I came to a short list of:
a) Typed Racket
b) OCaml
c) Julia
I would love to see those three added to your benchmarks. And Mistral Medium 3.5 added to the LLM list, please.
Thanks for the recs, we will look into adding some of these, maybe OCaml for variety. I'm not familiar with Racket.
Mistral Medium 3.5 is on there, but you will have to scroll down pretty far to find it (does not perform well): https://gertlabs.com/rankings?mode=oneshot_coding
Racket is a variety of Scheme that grew up as a teaching language, but now also has a few other notable niches as well.
Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".
Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.
I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)
PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.
When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.
Thanks for that, I hadn't scrolled down far enough.
Just want to be sure I'm reading the results correctly... When I compare GPT-5.5 with Mistral Medium 3.5, I see in the tables:
a) Mistral beats GPT in Java and C++
b) It's close for Rust
c) GPT-5.5 easily wins for Go, Javascript, Python and Typescript
Model choice really does appear to be language dependent (assuming I'm reading the results correctly).
The deeper you go into the filters (single models, cross correlated by specific languages), the smaller your sample sizes. A known limitation, tbh I doubt Mistral is better than GPT 5.5 at programming in any specific language and probably hit a few lower quality generations by GPT 5.5 by chance (but I could be wrong! We're always adding more samples so data improves over time. We always prioritize largest sample counts for near-frontier models first).
What's going on with Qwen3.6 27b? Filtered to Python it comes out at the top of the list, which seems... well, unlikely.
While Qwen3.6 27B and 35B-A3B are very good, I am skeptical about them being that good. I think another factor is at play here.
The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.
Qwen3.6 27b is a really strong model.
Yeah but that strong?
Yes that strong. Its only lacking in context length, but it's not that small there and it gets caught in circles more often then say a 1t parameter model does.
That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.
Those are some fine languages, but how did you pick them? What was the criterion?
The initial criteria was strongly typed and functional first. Using an LLM for answers, of course, that returned me a list that looked like:
- Haskell
- OCaml
- F#
- Scala
- Gleam
- Purescript
- Grain
- Idris
Then I asked if there were any Schemes or Lisps that met the initial requirements, which added a bunch more options (Typed Racket, Typol, Elm, ReScript etc).
Then I asked about Julia specifically, as it's a language I'm already reasonably familiar with and knew that it's possible to write it with static annotations.
Next I started filtering the list based on additional criteria; didn't want to target a JS compilation target, performance, size of package ecosystem, tooling, community, learning curve (I do want to review and understand the output).
There were a bunch of follow-up questions over a few hours of prompting, reading and a couple of beers. All this resulted in the shortlist of OCaml, Typed Racket and Julia.
Julia pretty much remains in there, even though it doesn't really meet the strongly typed initial criteria, based on my familiarity, the ecosystem especially for AI/ML tasks and performance factors.
I know zero about OCaml and find the thought of learning it a bit daunting. Typed Racket seems more approachable anyway.
I just did a side-by-side with Claude Code Python vs. Raku for DSL use ... https://slangify.org if you are interested.
What would comparing rates across languages tell in the context of this benchmark? Are the tasks the same or robustly difficulty-normalized across the languages?
Also somehow the 2 language comparison graphs (avg percentile and success rate) rank Python in dramatically different positions, with Python outranking Rust and Java in the success rate. What does the avg percentile mean in this context?
Success rate measures the amount of code submissions that played the game/environment without failing (compilation, breaking game rules, violating sandbox, etc.), so it makes sense Python would do better there.
Percentile compares only the submissions that didn't hard-fail. So they are a bit different, and we incorporate them both into the combined score.
> Data here: https://gertlabs.com/rankings?mode=agentic_coding
Oh wow, we got "tribal domination", "market simulator" and "adversarial customer service". I don't know what those are but it sure sounds like big torment nexus milestones
Maybe we could at least play nicer games like hackenbush and act surprised when there's some wicked use-case that's isomorphic.
EDIT: Ok fine. I like "Rubik's Cube Chess" a lot. Never heard of it, is this analyzed formally at all? Hard to search for since there's tons of collisions
The LLMs are generally still pretty bad at (deductive) reasoning. IME they go along more with the things like variable names and comments than the actual program logic (it would be an interesting experiment to compare LLM's understanding of three identical programs with different identifiers, one with normal identifiers, one with obfuscated identifiers, and one with deliberately misleading identifiers). I also think this particular comparison comes down to typing, which helps to avoid LLM's reasoning go astray.
When we reason we need to typically propagate the constraints to arrive at a solution to these constraints. I think the best language to reason in could be something like Lean, which allows both constraints and actual code to be expressed at the same time. Although this might not be the case for current LLMs, as I explain above.
wait till you look inside a neural network and realize they're incapable of deductive reasoning! amazing how many devs that talk about "AI" would probably have a hard time telling apart deductive and inductive reasoning.
That's actually untrue. Yes, training a neural network is mostly inductive reasoning process. However, the ability of LLMs to reason deductively (as a chain of thought, although it's probably not the only mechanism) is an emergent phenomenon, rising up from the training it on data and problems that exhibit deductive reasoning.
But of course, because the deductive reasoning is inductively taught, there might be various shortcuts which compromise the soundness of deductive reasoning. That's why my claim - LLMs are not as good at it as other algorithms, although they have many other strengths that make up for it.
How so?
Cool to see my hunch be backed by data. Python is a scripting language with OOP bolted on. Means there’s not really a styling consistency that other languages have, with things tending to look like PHP, a collection of various scripts that invoke one another
Python was designed with objects in mind from day one.
"Designed" is doing a lot of work here. There are clearly bits that are just bolted on because they didn't want to change the syntax.
EVERYTHING in Python is an object. I’m not sure how that could have been bolted onto the language
My feeling is that for agentic tasks this is not only language design but also LSPs, error messages and static analysis capabilities that dominate the benchmarks. It would IMHO be interesting to look into better subsets of python and style/rewrite techniques as well as alternative linter and their effects on performance.
A strict compiler is basically a free feedback loop for the LLM.
Also the human. (I like being told about my bugs when I write them, instead of at some generally much more unpleasant moment in the future.)
But then why does JS score 50% better? (Almost identical to TypeScript.)
Actually, JS can get a surprising amount of "intellisense" as well. Not sure if that was used here though.
[dead]
Huh. This surprises me. Digging, it seems it looks like it comes down to interpreted + dynamically typed vs compiled and statically typed.
TIL. If i were to start a truly vibe project; Go would have a significant leg up.
and yet dynamically typed elixir wipes the floor with go.
https://github.com/Tencent-Hunyuan/AutoCodeBenchmark/blob/ma...
LLMs get ridiculous with elixir, especially with the repl, runtime, and ability to hot reload / directly test functions. It's really surprising to me it hasn't caught on more but I guess you have to see it to believe it.
built my startup in elixir and can concur. elixir has a relatively consistent syntax that makes for a pretty good target for llms.
In my opinion, the only thing holding elixir back as an llm deliverable is that there's not as much training data for llms to work with.
Of course if we had a new AI that could be trained on a minimum of existing training data, common lisp would absolutely beat out everything else. everything you mentioned about elixir (repl, runtime, and ability to hot reload / directly test functions) are possible and were invented in lisp with an AST instead of a syntactic language as the ultimate build artifact. CL lets you recover from exceptions and rewind the stack before reloading your fixes and continuing. I can't even fathom the workloads an LLM could conceive of working with that.
Mm, the code is constrained to run inside a game 'tick'?
I thought it might have to do with the type system, but JavaScript type system is atrocious and it scores about 50% higher. So my theory does not make much sense.
Hey they said it had a lot of training data, not necessarily high-quality python code training data.
This surprised me, but I can understand it - Python sucks in many ways lol.
[dead]
My standard joke here:
Q: Say, what does this Python code do?
A: Nobody f&%^ing knows.
That’s Perl.