Why don't they ask their premier model to generate a bench for them?

It's not a crazy idea. Have the older model interview the newer one and then ask both (or maybe a third referee model) which one they think is smarter. Repeat 100x with different seeds. The percentage of times both sides agree the newer model won is the score.