Hacker News

make3 3 hours ago [ - ]

I guarantee you there's positional information one way or another. they just don't mention it because positional embeddings are extremely cheap computationally, not worth mentioning

neosat 3 hours ago [ - ]

Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.

mchinen 3 hours ago [ - ]

Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.