We use a rotating pool of ~100 games for the coding parts of the benchmark, and are scored objectively based on ratings similar to Elo. Models write code submissions to interact with the environment, then are evaluated in large batches against other submissions.
We test 11 popular/interesting languages (you can see the Languages chart to filter), but not Elixir -- although other evaluations have found that many LLMs solve more problems when working with Elixir [0]. Why models write code well in some languages over others seems to go beyond pre-training data (Python scores quite low for most models) and we don't fully understand it.
[0] https://elixirforum.com/t/llm-coding-benchmark-by-language/7...
An expressive and well designed language (elixir) is objectively better than a less well designed language like python. Python probably needs more LoC than elixir for the same task. Python is also untyped by default.
Elixir is not just expressive, it's highly conventional. I've found best practice code usually converges on the same idiomatic patterns, and well written codebases look very similar to each other in style
Thanks!