If this makes sense, how is the study able to give a reasonable measure of how long an issue/task should have taken, vs how long it took with AI to determine that using AI was slower?

Or it's comparing how long the dev thought it should take with AI vs how long it actually took, which now includes the dev's guess of how AI impacts their productivity?

When it's hard to estimate how difficult an issue should be to complete, how does the study account for this? What percent speed up or slow down would be noise due to estimates being difficult?

I do appreciate that this stuff is very hard to measure.

An easier way to think about it might be if you timed how long it took each ticket in your backlog. You also recorded whether you were drunk or not when you worked on it, and the ticket was selected at random from your backlog. The assumption (null-hypothesis) is that being drunk has no effect on ticket completion time.

Using the magic of statistics, if you have completed enough tickets, we can determine whether the null-hypothesis holds (for a given level of statistical certainty), and if it doesn't, low large is the difference (with a margin of error).

That's not to say there couldn't be other causes for the difference (if there is one), but that's how science proceeds, generally.

The challenge with “controlled experiments” is that saying to developers to “use AI for all of your tickets for a month” forces a specific tool onto problems that may not benefit from that tool.

Most corporate software problems don't need AI at all. They're really coordination/communication/administration problems hiding as technical problems.