I have a thing where I immediately doubt any ML paper that imitates a process then claims that the model is sometimes “even better” than the original process. This almost always means that there is an overzealous experimenter or a PI who didn’t know what they were dealing with.

Hello, lead author here. First: you are right! A surrogate model is a fancy interpolator so, eventually, it will just be as good as the model it is trying to mimic, not more. The piece that probably got lost in translation is that the codes we are mimicking have some accuracy settings, which sometimes you can't push to maximum because of the computational cost. But with the kind of tools we are developing, we can push these settings when we are creating the training dataset (as this is cheaper than running the full analysis). In this way, the emulator might be more precise than the original code with "standard settings" (because it has been trained using more accurate settings). This claim of course needs check: if I am including an effect that might have a 0.1% on the final answer but the surrogate has an emulation error of order 1%, clearly the previous claim would not be true.

There are straightforward emulation settings in which a trained emulator can be more accurate than a single forward run, even when both training and "single forward run" use the same accuracy settings.

Suppose you emulate a forward model y = F(x), by choosing a design X = {x1, ..., xN}, and making a training set T = {(x1, y1), ..., (xN, yN)}.

With T, you train an emulator G. You want to know how good y0hat = G(x0) is compared to y0 = F(x).

If there is a stochastic element to the forward model F, there will be noise in all of the y's, including in the training set, but also including y0! (Hopefully your noise has expectation 0.)

(This would be the case for a forward model that uses any kind of Monte Carlo under the hood.)

In this case, because the trained G(x0) is averaging over (say) all the nearby x's, you can see variance reduction in y0hat compared to y0. This, for example, would apply in a very direct way to G's that are kernel methods.

I have observed this in real emulation problems. If you're pushing for high accuracy, it's not even rare to see.

More speculatively, one can imagine settings in which (deterministic) model error, when averaged out over nearby training samples in computing y0hat, can be smaller than the single-point model error affecting y0. (For example, there are some errors in a deterministic lookup table buried in the forward model, and averaging nearby runs of F causes the errors to decrease.)

I have seen this claim credibly made, but verifying it is hard -- the minute you find the model error that explains this[*], the model will be fixed and the problem will go away.

[*] E.g., with a plot of y0hat overlaid on y0, and the people who maintain the forward model say "do you have y0 and y0hat labeled correctly?"