Would 1.0 have fixed the wide variance in scoring?

temperature is the wrong tool

the variance is caused by the bad evaluation prompt

if you ask "what is the capital of Paris" you'll always get Paris, with any (non-extreme) temperature