At 20 min per task you might as well code it yourself. Bill James needs to write a book on saber-metrics for LLM benchmarks.