Totally aside from disagreement between models unbiased by prior input any such experiment may fail to capture the outcomes experienced by real users whose prior text exchanges may substantially change the text recieved.
For instance see the folks who think that they have "awakened" their instance of ChatGPT.
Actual usage may diverge to a greater degree than models