To add, all the new audio models (partially) use diffusion methods that are exactly the same methods as used on images - the audio generation can be thought of as an image generation of a spectrogram of an audio file.

For early experiments people literally took Stable Diffusion and fine tuned it on labelled spectrograms of music snippets, then used the fine tuned model to generate new images of spectrograms guided by text, and then took those images and turned them back into audio via re-synthesis of that spectral image to a .wav.

Riffusion was one of the first to experiment with this, 2 years ago now: https://github.com/riffusion/riffusion-hobby

The more advanced music generators out now I believe have more of a 'stems' approach and a larger processing pipeline to increase fidelity and add tracking vocal capability but the underlying idea is the same.

Any adversarial attack to hide information in the spectrograph to fool the model into categorizing the track as something it is not isn't different than the image adversarial attacks which have been found to have ways to be mitigated.

Various forms of filtering for inaudible spectral information coupled with methods that destroy and re-synthesize/randomize phase information would likely break this poisoning attack.