Read the first few comments and surprised I didn’t see it, but training data. The voluminous amount of Python in the training data.
I could write in brainfuck with ai, but I presume, wouldn’t get the same results than if going with python.
My follow up question: with AI now, why care about a lang until you need to?
Surprisingly, LLMs are actually much worse at reasoning in Python than other common programming languages for agentic coding tasks.
Data here: https://gertlabs.com/rankings?mode=agentic_coding
Hah, I was just thinking that Python likely has a vast ocean of training data, but it's likely of lower quality, being much of it is written by beginners and those who aren't primarily programmers.
That's what I'm thinking too. There is a lot of noise and I know teams where the majority of the people writing Python just have no idea what they're doing.
I'm working with Clojure which is used mostly by senior engineers and it still blows my mind how well Claude writes software in it even though it's a fringe language. It's even able to pick up in-house DSLs written with macros.
Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time: sync vs. async, typed vs. untyped, scientific Python looking very different from web application code, some people really wishing it were an FP language, and others doing the clean-architecture OOP onion soup. It has gotten so fragmented.
Recently, I had a more pleasant experience using LLMs with Go. It reminds me a bit of Python 2.x, when the community seemed, in my view, more focused on embracing a stupid simple language, with everyone trying to write roughly similar "Pythonic" code.
> Having used Python on and off for 20 years, my experience with LLMs writing Python has been mixed. I don’t think that’s necessarily because of a low-quality dataset, but rather because Python’s applications are so broad and the language has gone through several paradigm shifts over time
If there’s one language that is the prime example of this, it’s C++, and according to this benchmark it ranks incredibly high.
I’m also thoroughly confused why Kimi 2.6 scores 83% while Opus 4.7 scores 67% for C++, GPT5.5 isn’t even in the top10.
Gemma 4 31B scores 100% success rate for Python (!!) while Opus 4.6 only 65%.
This benchmark really seems to be all over the place and doesn’t make sense.
That was the hardest part of learning PHP, all the code examples online were just awful.
Worked on a PHP project once. Every time I asked why something was done a certain way the answer was "dunno, we copy pasted this code snippet."
Certain popular PHP codebases appear to use a similar methodology.
Reminds me of the time I asked Claude to write some Wordpress code for me. The results were…rough.
I was (pleasantly) surprised by Claude Code doing Raku - also with a limited training set (~2000 Stack Overflow, a bunch of Rosetta, 2,500 modules). I put this down to the quality of the code for the core community who are all frankly uber-gremlins.
Yeah Raku feels so expressive and lovely to me with the help of an AI assistant. I've only done toy programs and scripts with it but it is actually so nice.
All my vibe coded projects (personal) are Go backend services, with Typescript/React frontend. And my thoughts were based on similar things. Like why I wouldn't use PHP for that, either.
There's a broken idea that AI know Python because they're written in Python.
Not how any of it works.
Not what anyone was talking about. Training corpus ≠ inference engine.
While recent models are capable of generalizing to any language at this point, I do think there are weights from their pretraining corpus that still leak through into how they create their responses. We observed similar language performance patterns across models from different providers, btw.
I’m super surprised that C++ scores so high, this does not match our experience at all, and for anything performance critical it always drops the ball completely.
I also don’t understand how these “games” map to real world complex problems. How are you measuring success? How does “adversarial customer service” map to “this LLM is better at C++ than the other” ? How are you sure you’re not just benchmarking language suitability for a problem ?
I have so many questions about this…
- The majority of the environments can be played where the agent writes code to work the environment towards a goal. So the model is problem solving, and it has to do so in a particular language, and some languages outperform others. We have a lot of data to back up the improved compiled language performance, but note these are for successful code submissions (failures are counted in a different metric). With the Languages chart we're moreso measuring how good the ideas they came up with were, once they already compiled/didn't fail basic environment rules.
- You need to run evals at scale to converge on this kind of behavior: these benchmarks run samples across a pool of hundreds of different types of environments
- Some games are too open-ended to support code play. The customer service game is an example of that, where models are called on every tick of the environment to make a decision (that's the 'decision making' part of the evals which is weighted lowest). Very interesting results but not testing coding ability, just general reasoning.
Not sure what issues you have with models writing C++ vs other languages, but I can imagine all sorts of C++ specific bottlenecks not directly related to the model's ability to reason in the language, like the dependencies, verbosity, extra effort to manage memory, etc. I have only done a little C/embedded work since agentic coding happened but I was pleasantly surprised.
I've found the current cream of the crop to be quite good at resource management. I've sic'd Opus on some very gnarly lambda context bugs and it has directly improved the stability of the product I'm working on right now in a very substantial way. It couldn't quite do it entirely by itself, but with the right nudges here and there, it has absolutely accellerated the debugging work. It is particularly good at analyzing crashes and piecing together the detective work of what preconditions must exist for certain crashes to occur.
I think my problem is that I’m not sure I understand whether you evals are testing language abilities or reasoning abilities.
It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.
I would love to see how they do with functional languages and especially Lisps here. I've noticed pretty good performance with Emacs Lisp relative to overall model strength, but I haven't used LLMs to application code in any such languages.
It would also be interesting to see how Python compares to other languages in its niche (Ruby, Perl, Raku).
Thanks for putting this together! It's interesting.
I've noticed that with clojure(script) unless you specifically instruct them to keep nesting levels low, they can hit a point where they make a paren placement error and can't debug their way out of it. Although in my case while one model made the error then couldn't find what it had done, a second model that I switched to was then able to identify it and back it out. So I suspect this is a transient weakness in today's models, not something fundamental.
That's a good idea. Would you rather see Lisp or Scala? Any interest in Prolog? We are trying to be selective to keep the data concentrated, but we will eventually add a couple more, most likely to sample different programming paradigms.
I think Clojure would probably make for a more interesting comparison because its syntax is more different from the other languages currently on there and it's less multi-paradigm than Scala is (it doesn't support OOP, it's more explicitly immutable-first). I think Scala is a lovely and cool language, but I'd be more interested in the Clojure comparison here.
Prolog night be interesting because I bet nobody is trying to train very hard on it, but I'm less directly interested in model performance with Prolog.
If you are taking request, I was hoping to see clojure on there.
My spider sense tells me the immutable-ness would help with correctness, but I'm not sure how much difference it would make in practice. Would love to see some numbers.
A relative lack of training data might have a bigger effect though.
Just last night I was going down the rabbit hole of "what's the best programming language to use for vibe coding." I came to a short list of:
a) Typed Racket
b) OCaml
c) Julia
I would love to see those three added to your benchmarks. And Mistral Medium 3.5 added to the LLM list, please.
Thanks for the recs, we will look into adding some of these, maybe OCaml for variety. I'm not familiar with Racket.
Mistral Medium 3.5 is on there, but you will have to scroll down pretty far to find it (does not perform well): https://gertlabs.com/rankings?mode=oneshot_coding
Racket is a variety of Scheme that grew up as a teaching language, but now also has a few other notable niches as well.
Typed Racket is to Racket as TypeScript is to JavaScript: it adds some additional static checks to an otherwise dynamic language via gradual typing. This pair of languages might help begin answer the question "does gradual typing generally help LLMs, or does TypeScript outperform JavaScript for incidental reasons?".
Among Lisps, I'm most interested in seeing Clojure because it's a language I can see myself using with LLMs at work. But Typed Racket and Racket could make an especially interesting pair because of the gradual typing thing.
I'm not sure whether you want to include them in your project. The kind of selectivity you describe yourself as going for is hard for me, especially since I'm not the one doing the work. :)
PS: Aside from this benchmarking and comparison project: Racket is an interesting language and seems like a good place to start if you want to explore classic Scheme texts (Structure and Interpretation of Computer Programs, The Little Schemer, How to Design Programs) or newer ones that try to teach newer or more specialized ideas (e.g., The Little Typer). You may have to tweak the language a bit to stay faithful to some of those books, but that's something Racket is good at and there are already sources noting relevant differences online.
When a non-programmer in my life expressed curiosity about programming, we ended up starting HtDP together and it's been fun. I think Racket was a good choice for that.
Thanks for that, I hadn't scrolled down far enough.
Just want to be sure I'm reading the results correctly... When I compare GPT-5.5 with Mistral Medium 3.5, I see in the tables:
a) Mistral beats GPT in Java and C++
b) It's close for Rust
c) GPT-5.5 easily wins for Go, Javascript, Python and Typescript
Model choice really does appear to be language dependent (assuming I'm reading the results correctly).
The deeper you go into the filters (single models, cross correlated by specific languages), the smaller your sample sizes. A known limitation, tbh I doubt Mistral is better than GPT 5.5 at programming in any specific language and probably hit a few lower quality generations by GPT 5.5 by chance (but I could be wrong! We're always adding more samples so data improves over time. We always prioritize largest sample counts for near-frontier models first).
What's going on with Qwen3.6 27b? Filtered to Python it comes out at the top of the list, which seems... well, unlikely.
While Qwen3.6 27B and 35B-A3B are very good, I am skeptical about them being that good. I think another factor is at play here.
The Qwen3.6 models have memorized some common games. For example, if you ask it to create an index.html with a snake game, it will generate almost the same high quality snake game every time. The relatively low success rate of 25% but high average percentile of almost 100% for one-shot coding in Python suggests that the model is extremely good at few tasks.
Qwen3.6 27b is a really strong model.
Yeah but that strong?
Yes that strong. Its only lacking in context length, but it's not that small there and it gets caught in circles more often then say a 1t parameter model does.
That's why a lot of people have been freaking out about local LLMs since april. There's finally a decent model that runs locally on a GPU or two that can do agentic programming at a reasonable enough tokens per second.
Those are some fine languages, but how did you pick them? What was the criterion?
The initial criteria was strongly typed and functional first. Using an LLM for answers, of course, that returned me a list that looked like:
- Haskell
- OCaml
- F#
- Scala
- Gleam
- Purescript
- Grain
- Idris
Then I asked if there were any Schemes or Lisps that met the initial requirements, which added a bunch more options (Typed Racket, Typol, Elm, ReScript etc).
Then I asked about Julia specifically, as it's a language I'm already reasonably familiar with and knew that it's possible to write it with static annotations.
Next I started filtering the list based on additional criteria; didn't want to target a JS compilation target, performance, size of package ecosystem, tooling, community, learning curve (I do want to review and understand the output).
There were a bunch of follow-up questions over a few hours of prompting, reading and a couple of beers. All this resulted in the shortlist of OCaml, Typed Racket and Julia.
Julia pretty much remains in there, even though it doesn't really meet the strongly typed initial criteria, based on my familiarity, the ecosystem especially for AI/ML tasks and performance factors.
I know zero about OCaml and find the thought of learning it a bit daunting. Typed Racket seems more approachable anyway.
I just did a side-by-side with Claude Code Python vs. Raku for DSL use ... https://slangify.org if you are interested.
What would comparing rates across languages tell in the context of this benchmark? Are the tasks the same or robustly difficulty-normalized across the languages?
Also somehow the 2 language comparison graphs (avg percentile and success rate) rank Python in dramatically different positions, with Python outranking Rust and Java in the success rate. What does the avg percentile mean in this context?
Success rate measures the amount of code submissions that played the game/environment without failing (compilation, breaking game rules, violating sandbox, etc.), so it makes sense Python would do better there.
Percentile compares only the submissions that didn't hard-fail. So they are a bit different, and we incorporate them both into the combined score.
> Data here: https://gertlabs.com/rankings?mode=agentic_coding
Oh wow, we got "tribal domination", "market simulator" and "adversarial customer service". I don't know what those are but it sure sounds like big torment nexus milestones
Maybe we could at least play nicer games like hackenbush and act surprised when there's some wicked use-case that's isomorphic.
EDIT: Ok fine. I like "Rubik's Cube Chess" a lot. Never heard of it, is this analyzed formally at all? Hard to search for since there's tons of collisions
The LLMs are generally still pretty bad at (deductive) reasoning. IME they go along more with the things like variable names and comments than the actual program logic (it would be an interesting experiment to compare LLM's understanding of three identical programs with different identifiers, one with normal identifiers, one with obfuscated identifiers, and one with deliberately misleading identifiers). I also think this particular comparison comes down to typing, which helps to avoid LLM's reasoning go astray.
When we reason we need to typically propagate the constraints to arrive at a solution to these constraints. I think the best language to reason in could be something like Lean, which allows both constraints and actual code to be expressed at the same time. Although this might not be the case for current LLMs, as I explain above.
wait till you look inside a neural network and realize they're incapable of deductive reasoning! amazing how many devs that talk about "AI" would probably have a hard time telling apart deductive and inductive reasoning.
That's actually untrue. Yes, training a neural network is mostly inductive reasoning process. However, the ability of LLMs to reason deductively (as a chain of thought, although it's probably not the only mechanism) is an emergent phenomenon, rising up from the training it on data and problems that exhibit deductive reasoning.
But of course, because the deductive reasoning is inductively taught, there might be various shortcuts which compromise the soundness of deductive reasoning. That's why my claim - LLMs are not as good at it as other algorithms, although they have many other strengths that make up for it.
How so?
Cool to see my hunch be backed by data. Python is a scripting language with OOP bolted on. Means there’s not really a styling consistency that other languages have, with things tending to look like PHP, a collection of various scripts that invoke one another
Python was designed with objects in mind from day one.
"Designed" is doing a lot of work here. There are clearly bits that are just bolted on because they didn't want to change the syntax.
EVERYTHING in Python is an object. I’m not sure how that could have been bolted onto the language
My feeling is that for agentic tasks this is not only language design but also LSPs, error messages and static analysis capabilities that dominate the benchmarks. It would IMHO be interesting to look into better subsets of python and style/rewrite techniques as well as alternative linter and their effects on performance.
A strict compiler is basically a free feedback loop for the LLM.
Also the human. (I like being told about my bugs when I write them, instead of at some generally much more unpleasant moment in the future.)
But then why does JS score 50% better? (Almost identical to TypeScript.)
Actually, JS can get a surprising amount of "intellisense" as well. Not sure if that was used here though.
[dead]
Huh. This surprises me. Digging, it seems it looks like it comes down to interpreted + dynamically typed vs compiled and statically typed.
TIL. If i were to start a truly vibe project; Go would have a significant leg up.
and yet dynamically typed elixir wipes the floor with go.
https://github.com/Tencent-Hunyuan/AutoCodeBenchmark/blob/ma...
LLMs get ridiculous with elixir, especially with the repl, runtime, and ability to hot reload / directly test functions. It's really surprising to me it hasn't caught on more but I guess you have to see it to believe it.
built my startup in elixir and can concur. elixir has a relatively consistent syntax that makes for a pretty good target for llms.
In my opinion, the only thing holding elixir back as an llm deliverable is that there's not as much training data for llms to work with.
Of course if we had a new AI that could be trained on a minimum of existing training data, common lisp would absolutely beat out everything else. everything you mentioned about elixir (repl, runtime, and ability to hot reload / directly test functions) are possible and were invented in lisp with an AST instead of a syntactic language as the ultimate build artifact. CL lets you recover from exceptions and rewind the stack before reloading your fixes and continuing. I can't even fathom the workloads an LLM could conceive of working with that.
Mm, the code is constrained to run inside a game 'tick'?
I thought it might have to do with the type system, but JavaScript type system is atrocious and it scores about 50% higher. So my theory does not make much sense.
Hey they said it had a lot of training data, not necessarily high-quality python code training data.
This surprised me, but I can understand it - Python sucks in many ways lol.
[dead]
My standard joke here:
Q: Say, what does this Python code do?
A: Nobody f&%^ing knows.
That’s Perl.
I had an itch to give Perl another go after a 5 year hiatus. I wanted a super simple way to spawn a proxy I was building in Go, along with writing various integration tests. I used Claude Code to write the bulk of it and found Claude to be remarkable good at Perl. I told Claude to only use what’s built into Perl’s standard library rather than reaching for anything in CPAN. Turns out everything from HTTP clients, TLS and JSON are all builtin which makes it a very stable and easy way to replace what I would normally have implemented in shell scripts. My theory is because Perl hasn’t changed all that much and has a ton of training data that Claude is actually quite good at Perl for cases where you might think to write shell scripts.
Many are saying this! https://til.andrew-quinn.me/posts/llms-make-perl-great-again...
Just use Go. LLMs have seen a ton of it, they write it well, it compiles practically instantly, and it has all the advantages of a typed compiled language.
I created a big Python codebase using AI, and the LLM constantly guesses arguments or dictionary formats wrong. Unit tests and stuff like pydantic help, but it's better to avoid that whole class of runtime errors altogether.
That’s what I’ve settled on. Python is so flexible that there are a million ways to organize code, pass arguments, etc. If you already have a code base to work from, an LLM can make new code in the style of the old code. But a fresh project? Once you get to a certain level of complexity it quickly can turn into write once, read never code (even if the code is passing tests).
This is where I’ve found that a compiled, strongly typed language (any one really) works well with an LLM. With the little bits of friction that is part of writing a language like Go, the LLM can produce pretty decent (and readable) code.
TIMTOWTDI strikes back.
Why use Go when you can use Rust?
1. Amount of Rust training data isn’t as much as Go.
2. Golang syntax and style is very verbose yet simple. There’s not as many options nor programming language to domain mapping needed as in Rust. Leads to needing less sophisticated LLM to spit out Golang than Rust successfully and efficiently.
This must really depend on your niche. I assume you do web stuff or something? Good luck finding any golang examples in a lot of other fields. Rust, on the other hand, is taking over the world in systems programming.
Been reading and drinking that kool-aid for some time until I realized it's just an internet bubble mumbo jumbo. Majority of systems are still written in C and C++, and will be for unforeseeable future.
>Good luck finding any golang examples in a lot of other fields.
There are go examples (and full blown programs) for anything, from servers to Kubernetes and Docker.
So I can test my feature today instead of waiting until it finishes compiling tomorrow.
this is the top reason for a reasonably complex project, but it can be worked around by preplanning crates.
the other reason is if you really want async as is in vogue nowadays, function coloring - but this is rapidly becoming irrelevant, see article.
> but it can be worked around by preplanning crates.
Maybe if you're working alone.
In short, compile times and a more full-featured stdlib
Doesn't Rust have long compile times? Does Go suffer from the same problem?
One of the design goals of Go was to be fast to compile. And they achieved it.
Go famously has stupidly fast compile times.
Because LLMs are better at Go? And because some people understand Go code easier and they might want to look at the code?
why,i have same question
I’m heavy into rust and never really use golang, but one big benefit of go over rust is compile times are significantly quicker, which could be more fun if you’re running CI checks 50 billion times
>which could be more fun if you’re running CI checks 50 billion times
Even running them 5 times it's WAY more fun
why use Rust when you can use Zig?
Why use zig when you can use odin?
>the LLM constantly guesses arguments or dictionary formats wrong [...] it's better to avoid that whole class of runtime errors altogether.
Use Mypy in strict mode and run it in the post-turn hook of your LLM harness so the LLM has no choice but to obey it. And don't use overly general dictionary types when the keys are known at development time; use TypedDicts for annotations if you must use dicts at runtime.
Why? Go has a GC, is basically incompatible with C and very limited overall
Go's limited syntax is actually a feature here,because it stops the LLM from trying to be too clever
LLMs use `any` types, `recover`, `init`, and other weird warts of golang
rust is a better language in every way for LLMs: more precise typing, better compiler errors, fewer performance footguns, no race conditions, clear interface definitions and implementations
golang is easier for humans to quickly get productive, but the language is lacking in helpful features for an LLM
'incompatible with C' isn't a serious problem nowadays and won't be a problem at all in a couple years.
CGO exists.
Yup, adopting Go is exactly what I've done too.
Typed, garbage collected, fast to compile and run, stdlib that includes just enough to work out of the box. I really don't like writing it by hand but for the LLM it's perfect.
But what is the selling point for Go? I get that it is allegedly hailed to be a simple language with basically no batteries included, but why is that a selling point? Does Go excel at anything no other language does?
No batteries!? Go has a huge stable standard library no other language even comes close to. Built in tooling for unit testing, performance testing, debugging, code formatting, package management, etc. And most go binaries can be compiled statically so libc is not even a dependency. Golang is the definition of batteries included.
>Go has a huge stable standard library no other language even comes close to
Well, Java and Python do.
Yet the first thing most people do before making a HTTP request is pip install requests
Yet, a nicer request wrapper is not the be all end all of batteries, and Python covers a huge spread of libs
> Go has a huge stable standard library no other language even comes close to.
Java, C#, Python, Node.
Go has a very full featured standard library.
It's simple (do you really ask why that's a selling point?)
It's fast to compile.
It's fast to run.
It's good with parallelism.
It has myriads of examples, and LLMs can pick it up well too.
It has good backing.
It has good tooling.
It's fun.
It statically compiles to a trivially deployable binary.
It's excellent at cross compiling.
It has good adoption.
1. It has first-class co-routines, so supports high concurrency without having to deal with async bullshit
2. It produces a dependency-less statically linked binary
3. Duck typed interfaces give you static typing with minimal ceremony. They are implemented even for types outside your own code base, which is a common pain point in Java or C#.
4. It compiles quickly
I really don't like the lang itself but nobody will deny it has a very strong ecosystem and stdlib for handling around 95% of many well-solved problems you are likely to encounter.
I picked Go because it tends to use fewer resources than Node.js, and startup time is quite fast.
For one thing it’s statically typed and has many fewer foot guns than Python, so the llm-produced code is more likely to do what you expect.
Go is statically typed but the type system leaves much to be desired.
Go’s benefit are primarily around simplicity, readability, and concurrency.
>Go is statically typed but the type system leaves much to be desired.
Not that much. Looking at Rust or Haskell complexity, I don't really desire it.
Python has much better type system than Go, I don’t know what you’re on. With Trio it has a better async capabilities too.
Performance? Second only to rust and other lower level langs. Surely you don't need this spelled out for you...
Not just performance, but static typing and prevalent in the training data/easy for LLMs to reason about.
Of course, your response admits, "second to Rust", which I am guessing is an unspoken question in the grandparent's mind.
Java and C# are there and faster.
Yes, but kids these days only consider JS, Python, Rust and Go.
If performance is the main difference, whatever that means, then basically Go should be reserved for when Rust and other lower level langs cannot be used due to some other constraint? Are we mainly talking about performant Web backends?
Say I am building some app that I know will be CPU-bound, why choose Go over say... Swift?
>If performance is the main difference, whatever that means, then basically Go should be reserved for when Rust and other lower level langs cannot be used due to some other constraint?
Or when performance is the main but not the only difference, and there are many other benefits.
>Say I am building some app that I know will be CPU-bound, why choose Go over say... Swift?
Because unless you're building for macOS/iOS, Swift is really a no-go, with lackluster support for other platforms. Plus slow to build and convoluted.
> why choose Go over say... Swift?
Language religious wars are silly: you should choose a language based on your constraints and personal tastes. If there's no clear advantage of one language over another for a given task - then all the options are viable, pick one and get on with solving the problem.
>I get that it is allegedly hailed to be a simple language
That might be its core feature if you do agentic coding.
I think that’s sort of the selling point no? It’s really boring. It has like -10 keywords, compiles insanely fast, and has a concurrency model that’s easy to use and read. LLMs are great at using Go tooling to sanity check along the way. It’s easy to write shitty Go but it’s really pleasant to work with if you find those things compelling.
don't you worry about garbage collection?
If you were using Python, then probably not.
haha exactly. I’m coming from Swift, and I don’t want to go back to manually releasing objects like I used to in ObjC, let alone reason about lifetimes.
What's the big issue with GC nowadays? It has mattered to me exactly once in decades and it was still manageable anyway by using a more low level style in a hot loop. I see very few usecases where GC actually matters and for those rare few cases it was not like you were using python beforehand anyway
Why the hell would he "worry about garbage collection"? That kind of thing is a cargo cult fear.
Garbage collection is not an issue for 99% of programs. And for those that it is, there are ways to mitigate the issue (e.g. there are extremely high performance trading system written in Java, where every last sub-millisecond counts).
Blanket fear of GC reminds me when new programmers learned about how assembly is lower level and can be faster, and wondered why everything is not written in assembly.
>Just use Go. LLMs have seen a ton of it, they write it well, it compiles practically instantly, and it has all the advantages of a typed compiled language.
Or any of the faster typed languages you are most comfortable with, as you might need to look at the code some times. LLMs are great at writing and understanding C# and Java.
Also there are still considerations like domain, team expertise, org ecosystem etc. to consider. I love to use Rust for most things, but now I'm working with an org that primarily has expertise in Java, and I'm not going to rock the boat for barely any reason. Python is also still useful for most ML stuff, and Django is quite a pleasure to work with (although it wouldn't be my first choice).
The great thing about LLM-assisted coding is that an experienced software engineer can acquire decent familiarity with a language quite quickly. And then has a useful sparring partner for understanding and using the quirks and features of a new language.
Same here, working with a team that knows Java, so I'm letting Claude write Java.
If I compare the results to another team that uses Python with Claude I see slightly better results on the Java side. Not because Claude knows that better, but because the tools are more rigid by default which creates more of a self correcting loop for Claude. The Python side has Pydantic, but it's a bit of an afterthought, while in Java you can't skip the type checking.
In the end you can do the same things on both sides, it's 95% a team/engineering culture difference. So pick the language that the team knows best.
Training data can't be the whole answer. LLMs are really good at translating to different programming languages. This makes sense, given that they are derived from text translation systems. I'm getting great results in languages with comparatively small bodies of freely available code. The bigger hurdle is usually that LLMs tend to copy common idioms in the target language and if it is an "enterprise-y" language like Java or C#, the amount of useless boilerplate can skyrocket immediately, which creates a real danger that the result grows beyond the usable context window size and the quality suffers.
> Training data can't be the whole answer.
Absolutely correct. Anthropic showed that 250 examples can "poison" an LLM -- independent of LLM activation count.
Very true.
I have to steer models hard for C++. They constantly suggest std::variant :P
is that bad?
Godbolt got a 2x speed improvement switching from what he thought was a good fast impl to std:variant
https://www.youtube.com/watch?v=gg4pLJNCV9I
In higher dimensional vector space, yes it can.
Dimensionality gets bizarre in 1000-D space. Similarity and orthogonality express themselves in strange ways and each dimension codes different semantic meaning.
Therefore, if the training data is highly consistent you are by definition reducing some complexity and/or encoding better similarity.
In Go the statement
Is almost always going to be followed by In a highly dynamic language you may not get Unless explicitly asked for.It's a little bit old, but challenge you opinions about what matters for LLM agentic coding:
https://github.com/Tencent-Hunyuan/AutoCodeBenchmark/blob/ma...
> In a highly dynamic language you may not get
Being dynamic is secondary. A language that uses exceptions for errors does not always need to surround every try with a catch if the code doesn't need to. You have a top level handler that would catch everything.
> LLMs are really good at translating to different programming languages.
...for which ample training data is available.
> This makes sense, given that they are derived from text translation systems.
...for languages with ample training data available.
Yes, LLMs can combine information in novel ways. They are wonderful in many respects. But they make far more mistakes if they can't lean on copious amounts of training data. Invent a toy language, write a spec, and ask them to use it. They will, but they will have a hard time.
I have a language I wrote for processing data pipelines. I’ve used it for years, but I can count the number of users on one hand. I wrote it partially to learn about writing a scripting language, partially because Nextflow didn’t exist yet. I still use it now because it works much better for my way of processing data on HPC clusters.
The only code that exists on the internet for this is test data and a few docs in the github repo. It’s not wildly different from most scripting languages, from a syntax point of view, but it is definitely niche.
Both Codex and Claude figured it out real fast from an example script I was debugging. I was amazed at how well they picked up the minor differences between my script and others. This is basically on next to zero training data.
Would I ask it to produce anything super complex? Definitely not. But I’ve been impressed with how well it handles novel languages for small tasks.
That might be an argument for not using a novel homebrew programming language. But it's not an argument against, like, any top-100 or even top-1000 programming language, which will be adequately represented in the training data.
It is if more training data results in better performance. In which case, GP will continue to use the language that is likely to have the most training data available.
> It is if more training data results in better performance.
Sure. But given the relation with translation systems, it seems far more likely that there are diminishing returns to larger volumes of training data.
They are also good at generating plausible code. The kind that has no obvious bugs in it. I wouldn’t be surprised if humans in the loop over report success with these tools. Combined with decision fatigue… it’s not a good recipe for humans making good decisions.
An experienced Rust developer is going to be in a better position to drive an agent to generate useful Rust code than a Python programmer with little or no Rust experience. Not sure I agree with the author that everyone should just generate reams of Rust now.
At least if your get paged at 3am to fix the 300k AI-generated Django blog you’ll have a chance at figuring things out. Good luck to you if Claude is down at the same time. But still better than if it was in Rust if you have no experience with that language.
That would matter if we were asking the AI to generate code open-loop: someone probably already wrote something close to what you asked for in Python. But if the agent generates code, tries to compile it, sees the detailed error messages and acts on those messages to refine the code, it's going to produce a higher quality result. rustc produces really good diagnostics. And there's a lot of Rust code online now, even if there's so much more Python and Javascript/Typescript.
LLMs don't actually semantically parse the error messages. They will generate the most likely sequence resulting from the error message based on their training data, so you're back to the training data argument.
They process those error messages in the same way that they process your instructions about what code to generate. It is just more commands.
Perhaps the training data about what compiler diagnostics mean is particularly semantically rich training data.
Of course they do, error messages get tokenized and put into the context window just like anything else. This isn't a Markov chain.
Except the presence of errors, mistakes, contradictions, and doubling-back causes LLMs to have substantially worse output, especially without dedicated sub-agents who have been instructed about that deficiency and know to process that kind of crap into better prompts to insert into a different LLM with pristine, error-free context. Without hard numbers we're both just pissing into the wind, but it's entirely plausible that the higher rate of errors matters more than the fact that those errors are more ergonomic. Anecdotally, my LLM work is a _lot_ more productive when I have it draft the thing in Python and translate it into Rust since it wastes so much time on the tiniest of syntactic mistakes.
I built a programming language, and LLMs can code phenomenally well in it.
I don't think the training set matters that much, since there's no way they have my language in their training set!
Programming languages have a lot in common. Python is kind of odd when it comes to languages.
If the training data is basically irrelevant, then an LLM should be able to iteratively improve the programming language it uses, resulting in a custom language optimally designed to maximize its own coding ability. The source code might not even be human readable natively, just translated into pseudocode on an as-needed basis.
> If the training data is basically irrelevant, then an LLM should be able to iteratively improve the programming language it uses, resulting in a custom language optimally designed to maximize its own coding ability.
I won't be surprised if one day they do.
At least in their current form, I don't think they can independently design a language that is so much better than other available ones that it makes sense for them to use it.
There's a very good language for almost every use case already, designing one better than the ones already available is a VERY tall order.
It's almost like these languages aren't designed by morons, but built by teams of geniuses over a decade instead.
It's taken me 6 months of heavily steering an LLM to build a language that is not yet even ready for production use.
Maybe I'm the one slowing the LLM down. But it certainly does not seem that way.
The key to a good language for them - from my experience - is maximum expression plus minimum global complexity.
Anything that makes you manage memory lifetimes & memory safety is inherently unfriendly to LLMs - that's globally complex.
All scripting languages allow spaghetti aliases that let you hack your way into oblivion - and LLMs gladly ride that gravy train to hell.
Rust excels here, because it prevents the worst and is WAY more expressive than most people think.
Go has arguably the best runtime ever built, but it's not very expressive, and it still has a lot of problems from C and scripting languages - I don't think these types of languages will be the ones people chose to write code with for LLMs in the future.
People really need to stop assuming more training data the better. This is not how it works. LLM thrive off consistency.
Go for example has significantly less training data than Python, but LLMs are the best at it. Why? Go is often written the same. You go from project to project and the code looks all the same. There only a very few ways to write Go.
Also, every single interpreter error has an entire corpus of StackOverflow-esque fix suggestions alongside it, and the model has been fine-tuned to minimize such errors on the first try. This hasn't been done for more obscure languages. You'll likely take more turns, on average, to get a working output, even if your problem is fully verifiable via test input/outputs - and if it's not verifiable, you don't want the "attention" of the model focused on syntax rather than the solution.
There is no "entire corpus of StackOverflow-esque fix suggestions" about anything which is newer than a few years. I'm using cutting edge Android frameworks all the time. Yet, LLMs fix problems even when Google/Kagi has zero answers, which happens more often than not. We are way over this requirement.
I especially found that there is no difference between languages based on that. All generated code's architecture is terrible, if you don't actively manually maintain them all the time. If you don't have a few 10s of thousands of finely architected code already in your codebase, from which they can understand how it should be really done. And the reason, I think, is quite simple: the average code on the internet - regardless of market penetration of the given language - is simply bad.
> I could write in brainfuck with ai, but I presume, wouldn’t get the same results than if going with python.
https://esolang-bench.vercel.app/
The conclusions seem overly broad. Just because these languages are Turing complete doesn't mean they aren't massively hampered by expressiveness and amount of batteries included. To attribute all of this to training data memorization is premature.
Oh this is a very damning paper. Using simple languages from their definitions alone is a great proxy for studying truly out-of-distribution reasoning. Also just for following simple rules/instructions correctly, because a simple enough language is practically just a grammar. This paper is terrible for anyone who wants to make the case that models can do those things well.
To the extent today's AI can reason, add this to the pile of evidence that you definitely need a harness. Counter to what you hear.. that seems true for SOTA and frontier, not just toy models. Lots of people were saying many years ago someone should test exactly this, because it's obvious. Someone at megacorp probably did try and decided not to publish because they thought it was bad optics.
and this sums it up right here.
Admittedly, I have very little experience with LLM-assisted Python. However, based on the severe degradation in output quality I have seen from an LLM working with plain JavaScript as opposed to TypeScript, I can't imagine choosing to start a project in Python at the moment.
It does seem like LLMs write better Python when told to use type annotations, especially when coupled with a linter.
I've been coding professionally in Python for about twenty years (alongside, at different times, a dozen or so other languages).
I find that Claude can write great modern Python more or less out of the box, with minimal style guidance from me. I do have to nudge it from time to time to not do silly things, but overall it's really rather good.
I wrote about the meta thesis of programming languages in the training data here
https://jry.io/writing/use-boring-languages-with-llms/
Please distill instead of having me navigate off site. Include link for additional info.
edit: side -> site
With AI it is important to catch errors/hallucinations early, static typing helps with that.
So languages with dynamic typing might hide some errors until runtime, static typing one could catch that during compilation.
With dynamic ones you need way more tests to cover some of the scenarios that compiler does for others.
And there is significant amount of code written "for ages" in languages that were there longer, like C, C++, Java (yes, I know that python is quite old, older than Java - 1991).
Seems to me these LLMs have a critical mass of Python training data and Rust training data, so there's no advantage for Python there.
So as the article points out, an iterative process that catches the mistakes at compile time is much more suited for an AI than one that catches them at runtime.
The LLMs are actually worse at generating Python than other langs, hypothesized due to quality of training data lol.
I still read the generated code, so I'm not quite willing to give up on Python yet though.
For some people reducing infra costs matter. Python is very very slow, even if it uses native libs.
Large volumes of training data is a blessing and a curse, especially when you consider who wrote it.
I loved from writing all my code with LLMs from Python to Rust. I’ve seen absolutely no difference, most of the time I couldn’t even tell you which it’s writing in.
My programs are faster and more reliable than they’ve ever been.
I wouldn't say I get worse results with Go than I do with Python.
that's right, we dont need to care about a lang, same as we dont care about Map when FSD promise its already end to end optimal one.
There's enough training data on the other langs.
1) the models do generalise so concepts translate 2) languages with more opinionated semantics and a better, more coherent community seem to be better. Python is a broad shitshow with multiple ways to achieve the same thing. Elixir is tight and focused. Claude is much better at elixir.
> Read the first few comments and surprised I didn’t see it, but training data. The voluminous amount of Python in the training data.
That's actually part of the point. Almost no one writes types for Python and has complete type compliance. So all that training data is people just yoloing Python, writing a bunch of poor code in it.
I honestly can't believe any experienced software engineer would decide to build systems in Python these days.
[dead]
No if that mattered you'd write everything in html and css. Because that has way more training data.
Those are not programming languages.
WASM then.
That's more of a compilation target than a programming language and I don't really see the relevancy...
"I could write in brainfuck with ai"
Well, go on and do the experiment! Perhaps LLMs can right code as well in BF as Python but I don't recommend it because hallucinations are really hard to notice in BF.
If you are going to worry about high level computer languages and AI, you are going to have to start with getting to grips with machine code and assemblers and that. Once you know how say some Python code ends up being processed by your laptop CPU(s), then you will know when BF might be best!
> Frontier models score ~90% on Python but only 3.8% on esoteric languages, exposing how current code generation relies on training data memorization rather than genuine programming reasoning.
https://news.ycombinator.com/item?id=48100433#48102985