My feeling is that for agentic tasks this is not only language design but also LSPs, error messages and static analysis capabilities that dominate the benchmarks. It would IMHO be interesting to look into better subsets of python and style/rewrite techniques as well as alternative linter and their effects on performance.

A strict compiler is basically a free feedback loop for the LLM.

Also the human. (I like being told about my bugs when I write them, instead of at some generally much more unpleasant moment in the future.)

But then why does JS score 50% better? (Almost identical to TypeScript.)

Actually, JS can get a surprising amount of "intellisense" as well. Not sure if that was used here though.

[dead]