It seems benchmarks keep changing and preferring the latest AI agent literally every time.