Those higher level kinds of mode collapse are hard to quantify in an automated way. To fix that, you would need interventions upstream, at pre & post training.
This approach is targeted to the kinds of mode collapse that we can meaningfully measure and fix after the fact, which is constrained to these verbal tics. Which doesn't fix higher level mode collapse on semantics & creativity that you're identifying -- but I think fixing the verbal tics is still important and useful.
> but I think fixing the verbal tics is still important and useful.
I don't. I think they're useful for flagging the existence of mode-collapse and also providing convenient tracers for AI-written prose. Erasing only the verbal tics with the equivalent of 's/ - /; /g' (look ma! no more 4o em dashes!) is about the worst solution you could come up with and if adopted would lead to a kind of global gaslighting. The equivalent of a vaccine for COVID which only suppresses coughing but doesn't change R, or fixing a compiler warning by disabling the check.
If you wanted to do useful research here, you'd be doing the opposite. You'd be figuring out how to make the verbal expressions even more sensitive to the underlying mode-collapse, to help research into fixing it and raising awareness. (This would be useful even on the released models, to more precisely quantify their overall mode-collapse, which is poorly captured by existing creative writing benchmarks, I think, and one reason I've had a hard time believing things like Eqbench rankings.)