Everything you wrote is very goes criticism and I would love to stress the main takeaway: you’re not just testing the model but the prompt and harness as well.

An excellent follow up study would be to change the prompt and compare the answers. You might find out that the models are good and the prompt is bad.

…And so the main corollary is: build evals for anything you deploy in production; benchmark and monitor, or face the consequences.