I do feel frustrated with the current state of evaluations for long-lived sessions with many tool calls -- by default OpenAI's built-in eval system seems to rate chat completions that end with a tool call as "bad" because the tool call response is only in the next completion.
But our stack is in Go and it has been tough to see a lot of observability tools focus on Python rather than an agnostic endpoint proxy like Helicone has.
we're working on that right now, we'd love to hear your opinions(if you're interested you can send us an email at team@lucidic.ai).