> It's a really fabulous study...
Ehhhh... not so much. It had serious design flaws in both the protocol and the analysis. This blog post is a fairly approachable explanation of what's wrong with it: https://www.argmin.net/p/are-developers-finally-out-of-a-job
Hey, thanks for linking this! I'm a study author, and I greatly appreciate that this author dug into the appendix and provided feedback so that other folks can read it as well.
A few notes if it's helpful:
1. This post is primarily worried about ordering considerations -- I think this is a valid concern. We explicitly call this out in the paper [1] as a factor we can't rule out -- see "Bias from issue completion order (C.2.4)". We have no evidence this occurred, but we also don't have evidence it didn't.
2. "I mean, rather than boring us with these robustness checks, METR could just release a CSV with three columns (developer ID, task condition, time)." Seconded :) We're planning on open-sourcing pretty much this data (and some core analysis code) later this week here: https://github.com/METR/Measuring-Early-2025-AI-on-Exp-OSS-D... - star if you want to dig in when it comes out.
3. As I said in my comment on the post, the takeaway at the end of the post is that "What we can glean from this study is that even expert developers aren’t great at predicting how long tasks will take. And despite the new coding tools being incredibly useful, people are certainly far too optimistic about the dramatic gains in productivity they will bring." I think this is a reasonable takeaway from the study overall. As we say in the "We do not provide evidence that:" section of the paper (Page 17), we don't provide evidence across all developers (or even most developers) -- and ofc, this is just a point-in-time measurement that could totally be different by now (from tooling and model improvements in the past month alone).
Thanks again for linking, and to the original author for their detailed review. It's greatly appreciated!
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf
Thanks for the response, you make some very points. Sorry, I had missed your response on the original post. I don't know if it was there yet, or because for some reason their blog is configured to only show the first two comments by default. :/ Either way, my bad.
I think my bias as someone who spends too much time looking at social science papers is that the protocol allows for spillover effects that, to me, imply that the results must be interpreted much more cautiously than a lot of people are doing. (And then on top of that I'm trying to be hyper-cautious and skeptical when I see a paper whose conclusions align with my biases on this topic.)
Granted, that sort of thing is my complaint about basically every study on developer productivity when using LLMs that I've seen so far. So I appreciate how difficult this is to study in practice.