We solve that partially by running the base state and the delta version through concurrently. This way most types of impact will impact both at the same time. This gives us the relative delta between versions.
Other than that, just like always, run benchmark on a stable dedicated set of hardware.
And run a bunch of iteration, and mash the results together maybe?
Yes. Or for a sufficiently long duration. Some things like allocation rate and significantly worse performance are obvious and can be seen in shorter runs.