When I was on Google Docs, I watched the Google Forms team build a sophisticated ML model that attempted to detect when people were using it for nefarious purposes.

It underperformed banning the word "password" from a Google Form.

So that's what they went with.

I wonder if this is just an example of Goodhart's law. How did they measure performance of those models? I would imagine they tried measuring against known cases of forms misuse, aka those forms that contained 'password' field.