My takeaway is that it's difficult to make a "generic enough" evaluation that encompasses all things we use an LLM for, e.g. code, summaries, jokes. Something with free lunches.