Hacker News

I do feel frustrated with the current state of evaluations for long-lived sessions with many tool calls -- by default OpenAI's built-in eval system seems to rate chat completions that end with a tool call as "bad" because the tool call response is only in the next completion.

But our stack is in Go and it has been tough to see a lot of observability tools focus on Python rather than an agnostic endpoint proxy like Helicone has.