Hacker News

furyofantares 5 months ago [ - ]

That's why you build extensive tooling to run your change hundreds of times in parallel against the context you're trying to fix, and then re-run hundreds of past scenarios in parallel to verify none of them breaks.

AdieuToLogic 5 months ago [ - ]

In the event this comment is slathered in sarcasm:

  Well done!  :-D

5 months ago [ - ]

[deleted]

ht96 5 months ago [ - ]

Do you use a tool for this? Is there some sort of tool which collects evals from live inferences (especially those which fail)

AdieuToLogic 5 months ago [ - ]

There is no way to prove the correctness of non-deterministic (a.k.a. probabilistic) results for any interesting generative algorithm. All one can do is validate against a known set of tests, with the understanding that the set is unbounded over time.

cantor_S_drug 5 months ago [ - ]

https://x.com/rerundotio/status/1968806896959402144

This is a use of Rerun that I haven't seen before!

This is pretty fascinating!!!

Typically people use Rerun to visualize robotics data - if I'm following along correctly... what's fascinating here is that Adam for his master's thesis is using Rerun to visualize Agent (like ... software / LLM Agent) state.

Interesting use of Rerun!

https://github.com/gustofied/P2Engine

aenis 5 months ago [ - ]

For sure, for instance Google has ADK Eval framework. You write tests, and you can easily run them against given input. I'd say its a bit unpolished, as is the rest of the rapidly developing ADK framework, but it does exist.

saturatedfat 5 months ago [ - ]

heya, building this. been used in prod for a month now, has saved my customer’s ass while building general workflow automation agents. happy to chat if ur interested.

darin@mcptesting.com

(gist: evals as a service)