It’s tough to write good questions for LLM evaluations. They’re so good at picking up subtleties they can pass a multiple choice test when given only the answers and not the questions.