One mediocre paper/study (it should not even be called that with all the bias and sample size issues) and now we have to put up with stories re-hashing and dissecting it. I really hope these don't get upvoted more in the future.
16 devs. And they weren't allowed to pick which tasks they used the AI on. Ridiculous. Also using it on "old and >1 million line" codebases and then extrapolating that to software engineering in general.
Writers like this then theorize why AI isn't helpful, then those "theories" get repeated until it feels less like a theory and more like a fact and it all proliferates into an echo chamber of AI isn't a useful tool. There have been too many anecdotes and my own personal experience to ignore that it isn't useful.
It is a tool and you have to learn it to be successful with it.
> And they weren't allowed to pick which tasks they used the AI on.
They were allowed to pick whether or not to use AI on a subset of tasks. They weren't forced to use AI on tasks that don't make sense for AI
That is not true, usage of AI was decided randomly. From the paper:
"To directly measure the impact of AI tools on developer productivity, we conduct a randomized controlled trial by having 16 developers complete 246 tasks (2.0 hours on average) on well-known open-source repositories (23,000 stars on average) they regularly contribute to. Each task is randomly assigned to allow or disallow AI usage, and we measure how long it takes developers to complete tasks in each condition."
Directly from the paper:
> If AI is allowed, developers can use any AI tools or models they choose, including no AI tooling if they expect it to not be helpful. If AI is not allowed, no generative AI tooling can be used.
AI is allowed not required
True, my bad, I didn't read you correctly. What you said was true.
I do believe however that it's important to emphasize the fact that they didn't got to choose in general, though, which I think your wording (even though it is correct) does not make evident.
Half the tasks they were not allowed to use AI.
Yes, and the other half they had the option to use AI. That's why I said they were allowed to pick whether or not to use AI on a subset of tasks. On the other subset they were not allowed to use AI.
It's just the same with all the anecdotal evidence of some hype guys on twitter claiming 10x performance on coding ... Same same but different
> and then extrapolating that to software engineering in general.
To the credit of the paper authors, they were very clear that they were not making a claim against software engineering in general. But everyone wants to reinforce their biases, so...
Great for the authors. But everyone else seems to be extrapolating. Authors have a responsibility and should recognize how their work will be used.
Metr may overall have an ok mission, but their motivation is questionable. They published something like this to get attention. Mission accomplished on that but they had to have known how this would be twisted.
>One mediocre paper/study (it should not even be called that with all the bias and sample size issues)
Can you bring up any specific issues with the metr study? Alternatively, can you site a journal that critiques it?
It was just published. Too new for someone to conduct a direct study to critique and journals don't just publish critiques anyway. It would have to be a study that disputes the results.
They used 16 developers. The confidence intervals are wide and a few atypical issues per dev could swing the headline figure.
Veteran maintainers on projects they know inside-out. This is a bias.
Devs supplied the issue list (then randomized) which still leads to subtle self-selection bias. Maintainers may pick tasks they enjoy or that showcase deep repo knowledge—exactly where AI probably has least marginal value.
Time was not independently logged and was self-reported.
No possible direct quality metric is possible. Could the AI code be better?
The Hawthorne effect. Knowing they are observed paid may make devs over-document, over-prompt, or simply take their time.
Many of the devs were new to Cursor
Bias in forecasting.