Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.