Hacker News

How do you run evaluations on your fine-tuned or RL-trained models today? I’m curious about:

Workflow: tools/scripts you rely on for metrics, drift, & other checks.

Headaches: the step that still breaks or slows you down.

Wishlist: if an open-source eval suite existed, what must-have features would land it in your stack?

Real stories (good and ugly) would be super helpful -- thanks in advance for sharing!

Also, please let me know if you'd like to be a very early user of the open-source evals tool we are building. I'll send an invite.