An easier way to think about it might be if you timed how long it took each ticket in your backlog. You also recorded whether you were drunk or not when you worked on it, and the ticket was selected at random from your backlog. The assumption (null-hypothesis) is that being drunk has no effect on ticket completion time.

Using the magic of statistics, if you have completed enough tickets, we can determine whether the null-hypothesis holds (for a given level of statistical certainty), and if it doesn't, low large is the difference (with a margin of error).

That's not to say there couldn't be other causes for the difference (if there is one), but that's how science proceeds, generally.

The challenge with “controlled experiments” is that saying to developers to “use AI for all of your tickets for a month” forces a specific tool onto problems that may not benefit from that tool.

Most corporate software problems don't need AI at all. They're really coordination/communication/administration problems hiding as technical problems.