We need a benchmark that tests a models ability to do LLM research.