This approach seems kind of backwards to me. Why try to detect everything except the thing you're trying to remove instead of either sampling a few uhs and ums and treating them as noise to be silenced (with a sharp crossfade to the noise floor that doesn't interrupt speech flow) or finetuning a model to detect them specifically for full automation?

> instead of either sampling a few uhs and ums and treating them as noise to be silenced

If you're not paying ttention, ctting out specific sounds can easily cause more trouble. I for one would be quite pset if I couldn't hear the pire's reasoning for calling a foul.