Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)
Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)
There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.
https://x.com/rerundotio/status/1968806896959402144
This is a use of Rerun that I haven't seen before!
This is pretty fascinating!!!
Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.
Interesting use of Rerun!
https://github.com/gustofied/P2Engine
For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.
heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.
darin@mcptesting.com
(gist: evals as a service)