Playing with the background I tried to Isolate just the espresso machine and the train sounds in one of their demos and it seemed to fail. Maybe not the desired use case, but I thought it was odd that I could break it so easily on the sample material.
Footsteps worked pretty well when I tried that on the other hand. I wonder if lot of it has to do with how well the model understands what the english description of the sound should sound like...
i do think that’s the case. i tried a few different ways to write x and got meaningfully varied results