I had a similar thought. What if the result, statistical and significance critique aside, mostly means that when it comes to first-year tutoring of law students, the vibe, tone and overall presentation of arguments weighs a lot, maybe even more than the factual arguments themselves?

In such a framing I don't find it surprising at all that teachers prefer the more polished answers generated by AI, because if LLMs are good at one thing, it is being confident in whatever they generate and present it convincingly.