Hacker News

I'm curious if evals of the DAEMONs and replays for debugging are on the roadmap?

I looked but did not see any facility for collecting/managing evals in the Charlie docs.

Docs drift might sound easy for agents but after working on it at https://promptless.ai for about two years, it's been tricker than just "make some skills". We've got an agent that watches PRs and suggests docs changes. Getting the suggestions good enough that doc owners would actually accept them took a fair bit of evals. Non-ai voice matching existing content, and even "simple" act of deciding whether a given PR warrants a docs change at all.

I have benefited greatly from evals catching things (especially as models change) to the point where I'm loath to go back.