Maybe try making a simple randomize script to swap the three latest models. And see if you can tell which ones are meaningfully different without knowing which ones are flipped on or off?

I find the quality ebbs and flows even on the same model. My guess it is something to do with GPU availability but only guessing.

Unless you're systematically repeating the exact same task, the most parsimonious explanation is that you're seeing natural variation based on different tasks, random sampling of tokens, etc.