What would comparing rates across languages tell in the context of this benchmark? Are the tasks the same or robustly difficulty-normalized across the languages?
Also somehow the 2 language comparison graphs (avg percentile and success rate) rank Python in dramatically different positions, with Python outranking Rust and Java in the success rate. What does the avg percentile mean in this context?
Success rate measures the amount of code submissions that played the game/environment without failing (compilation, breaking game rules, violating sandbox, etc.), so it makes sense Python would do better there.
Percentile compares only the submissions that didn't hard-fail. So they are a bit different, and we incorporate them both into the combined score.