You are assuming input-transformation based defenses in the image domain transfer to the music recognition domain, when we know they don't automatically even transfer to the speech recognition domain.

But 'protection' of any one song isn't the entire point. It only takes less than a fraction of a percent of corpus data to have persistent long term effects in the final model, or increase costs and review requirements to those stealing their content.

As most training is unsupervised, because the cost and limited access to quality, human labeled data, it wouldn't take much if even some obscure, limited market, older genres which still have active fan bases, like Noise rock to start filtering into recommendation engines and impact user satisfaction.

Most of the speech protections, just force attacks to be in the perceptible audio range, with lo-fi portions like those of TripHop, that would be non-detectable without the false positive rate going way up. With bands like Arab On Radar, Shellac, or The Oxes, it wouldn't be detectable.

But it is also like WAFs/AV software/IDS. The fact that it can't help with future threats today is immaterial. Any win of these leaches has some value.

Obviously any company intentionally applying even the methods in your linked paper to harvest protected images would be showing willful intent to circumvent copyright protections and I am guessing most companies will just toss any file that it thinks has active protections just because how sensitive training is.

Most musicians also know that copyright only protects the rich.