No, I am not affiliated with the website, I just want to see more discussions based on uncontaminated benchmarks and feel that people rely too much on benchmarks that companies can conduct themselves. If that is the case, I don't feel I can trust them. For general LLM capabilities, for example, I would also tend to rely on dubesor [1] rather than artificial analysis or similar leaderboards.
Though this Codex version isnt on the leaderboard, GPT-5.2-Medium already seems to be a bit better than Opus 4.5: https://swe-rebench.com/
Is that your website or something? You keep promoting it
No, I am not affiliated with the website, I just want to see more discussions based on uncontaminated benchmarks and feel that people rely too much on benchmarks that companies can conduct themselves. If that is the case, I don't feel I can trust them. For general LLM capabilities, for example, I would also tend to rely on dubesor [1] rather than artificial analysis or similar leaderboards.
[1] https://dubesor.de/benchtable