Not hard to understand what's going on here. They RL'd around patterns in their data and specific capabilities, so of course they'd construct a benchmark that's aligned with the training set.

Ironically, their benchmark might be more accurate than artificial analysis for a narrow slice of things that Cursor's Eigencustomer is really interested in. Otherwise I'd take it as just another data point.

(I work at Cursor) CursorBench includes many evals from actual engineering tasks from the Cursor team, which include our private codebase. This codebase is held-out from training so models haven't seen it, including Composer.