How do you run evaluations on your fine-tuned or RL-trained models today? I’m curious about:
Workflow: tools/scripts you rely on for metrics, drift, & other checks.
Headaches: the step that still breaks or slows you down.
Wishlist: if an open-source eval suite existed, what must-have features would land it in your stack?
Real stories (good and ugly) would be super helpful -- thanks in advance for sharing!
Also, please let me know if you'd like to be a very early user of the open-source evals tool we are building. I'll send an invite.