It was just published. Too new for someone to conduct a direct study to critique and journals don't just publish critiques anyway. It would have to be a study that disputes the results.

They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure.

Veteran maintainers on projects they know inside-out. This is a bias.

Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value.

Time was not independently logged and was self-reported.

No possible direct quality metric is possible. Could the AI code be better?

The Hawthorne effect. Knowing they are observed paid may make devs over-document, over-prompt, or simply take their time.

Many of the devs were new to Cursor

Bias in forecasting.