Started on making my own AI model benchmarks and leaderboard[0], after I tested MiniMax M2.5, which was supposedly good based on standard benchmarks, but peformed really poorly in practice and burned through hundreds of thousands of reasoning tokens for each request...

[0]: https://aibenchy.com