- The majority of the environments can be played where the agent writes code to work the environment towards a goal. So the model is problem solving, and it has to do so in a particular language, and some languages outperform others. We have a lot of data to back up the improved compiled language performance, but note these are for successful code submissions (failures are counted in a different metric). With the Languages chart we're moreso measuring how good the ideas they came up with were, once they already compiled/didn't fail basic environment rules.

- You need to run evals at scale to converge on this kind of behavior: these benchmarks run samples across a pool of hundreds of different types of environments

- Some games are too open-ended to support code play. The customer service game is an example of that, where models are called on every tick of the environment to make a decision (that's the 'decision making' part of the evals which is weighted lowest). Very interesting results but not testing coding ability, just general reasoning.

Not sure what issues you have with models writing C++ vs other languages, but I can imagine all sorts of C++ specific bottlenecks not directly related to the model's ability to reason in the language, like the dependencies, verbosity, extra effort to manage memory, etc. I have only done a little C/embedded work since agentic coding happened but I was pleasantly surprised.

I've found the current cream of the crop to be quite good at resource management. I've sic'd Opus on some very gnarly lambda context bugs and it has directly improved the stability of the product I'm working on right now in a very substantial way. It couldn't quite do it entirely by itself, but with the right nudges here and there, it has absolutely accellerated the debugging work. It is particularly good at analyzing crashes and piecing together the detective work of what preconditions must exist for certain crashes to occur.

I think my problem is that I’m not sure I understand whether you evals are testing language abilities or reasoning abilities.

It seems to present results as if they’re testing language abilities, but the problems seem to be reasoning problems.