It's an internal benchmark that I use to test prompts, models and prompt-tunes, nothing but a dashboard calling our internal endpoints and showing the data, basically going through the prod flow.
For my product, I run a video through a multimodal LLM with multiple steps, combine data and spit out the outputs + score for the video.
I have a dataset of videos that I manually marked for my usecase, so when a new model drops, I run it + the last few best benchmarked models through the process, and check multiple things:
- Diff between outputed score and the manual one - Processing time for each step - Input/Output tokens - Request time for each step - Price of request
And the classic stats of average score delta, average time, p50, p90 etc. + One fun thing which is finding the edge cases, since even if the average score delta is low (means its spot-on), there are usually some videos where the abs delta is higher, so these usually indicate niche edge cases the model might have.
Gemini 3 Flash nails it sometimes even better than the Pro version, with nearly the same times as 2.5 Pro does on that usecase. Actually, pushed it to prod yesterday and looking at the data, it seems it's 5 seconds faster than Pro on average, with my cost-per-user going down from 20 cents to 12 cents.
IMO it's pretty rudimentary, so let me know if there's anything else I can explain.