We have an OCR job running with a lot of domain specific knowledge. After testing different models we have clear results that some prompts are more effective with some models, and also some general observations (eg, some prompts performed badly across all models).
Sample size was 1000 jobs per prompt/model. We run them once per month to detect regression as well.