Thanks for your feedback on the video. Great point about going step by step, instead of switching mid-stream to a pre-built session :). We have another even simpler version which goes slower, step by step which would have been better for this post? The challenge has been balancing between showcasing the wide feature set with duration.
We have a spreadsheet integration (which I might post as a comment) for the usecase you mentioned. The scorer is quite light weight so easy to integrate it in your existing pipelines instead of building yet another pipeline/framework. The co-pilot is specifically for triangulating the right set of metrics (that are subjective based on your taste), which does require looking at examples a few at a time and make a judgement call. But I agree that once you are done with that you want to quickly transition off of this to either code or other frameworks like sheets, promptfoo etc.