Roughly, we had Cursor software engineers record real questions they were asking models, and then had them record the PR that they made that contained the result. We then cleaned these up. That is the benchmark.

Are you able to give a sense of how many questions, which domains they were split over, and how that split looked in % terms?

As a user, I want to know - when an improvement is claimed - whether it’s relevant to the work I do or not. And whether that claim was tested in a reasonable way.

These products aren’t just expensive - it requires switching your whole workflow. Which is becoming an increasingly big ask in this space.

It’s pretty important for me to be able to understand, and subsequently, believe a benchmark - I find it really hard not to read it as ad copy where this information isn’t present.

Which programming languages/tools/libraries did the teams questions/code involve?