LLMs have positional, response length and hedge word biases (and that's just what's rigorously demonstrated in papers) that wash out differences between high performing answers as you approach the limit of your objective. Imagine if you were trying to optimize a function and the measurement function emitted random biased noise, at some point you wouldn't be able to accurately identify the impact of your changes.