I haven't seen such a benchmark although maybe it exists.

As far as benchmarks go, I'd also like to see benchmarks that try to find what LLMs are good at. Most of the benchmarks seem designed to give LLMs hard problems and see if they can succeed. In that sense a "good" benchmark is one with a low pass rate.

But if we're going to do agentic coding we also need to know the opposite. We need to know which types of tasks given in which format LLMs will succeed at with like 95%+ accuracy. Then we can more easily build multi prompt pipelines with high confidence in each step.