Good idea about the leaderboard for open vs closed models!

Point taken on using an agent. I went that route because part of the goal for this benchmark is to inform which models I push in my agentic word processor, which uses tools for focused proofreading/editing. It's much faster and generally cheaper to use tools for surgical changes on large documents, rather than having the model spit out the entire document with all issues corrected. So yes, I am trying to measure agentic abilities here.

A simple one-pass full-rewrite test would also make an interesting benchmark, though.