Hacker News

There was a thoughtful reply here with some feedback on the scores. It seems to have been deleted while I was writing a reply. In the interest of substantive discussion, I'm posting my reply below, since I think it's still valuable information.

Yes, I noticed there were a few things off about this example (e.g., certain questions feel inapplicable for certain examples, or certain scores feel too optimistic for the bad example), and I intentionally left them there so as not to window dress things too much.

To add to some of your observations, I'll note that: 1. Automatic question generation runs before I've given any examples in this particular chat. This can be a positive (in that you can get started without even providing an example of your own data), but it also means we sometimes add questions that don't make sense for the data you actually have. The co-pilot is meant to be iterative for that reason. (As an example, towards the end of the chat, I do ask it to remove some questions that don't feel applicable).

2. The model still has to output a score for all questions, even if they don't apply to a particular input. We're working on a new system that will understand which questions actually apply, and can turn certain questions off if they're impossible to answer given the inputs provided.

3. We do get feedback from users that the scores feel off sometimes. In some cases, they're too high; in others, they're too low. We're working on an interface for calibrating the scores to your own preferences, e.g. with a small amount of thumbs up/thumbs down data. There's a tradeoff here, though, because we're also trying to make the evaluation process a lot easier than today's "prompt an LLM as a judge" paradigm where writing a prompt with a rubric can take a substantial amount of time, and any kind of calibration adds friction for users.

Overall, we release new models every other week and track ourselves on internal benchmarks to see improvements to both question generation and question scoring. If you play around with the system more and find other ways it's not working as you'd expect or like, please feel free to email me your examples and we'd be happy to prioritize looking into them.