Look long enough at literature on any machine learning task, and someone invariably gets the idea to turn the data into an image and do machine learning on that. Sometimes it works out (turning binaries into images and doing malware detection with a CNN surprisingly works), usually it doesn't. Just like in this example the images usually end up as a kludge to fix some deficiency in the prevalent input encoding.
I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is
A great example of this is changing music into an image and using that to train and generate new images that get converted back into music. It was surprisingly successful. I think this approach is still used by the current music generators.
The current music generators use next token prediction, like LLMs, not image denoising.
[0] https://arxiv.org/abs/2503.08638 (grep for "audio token")
[1] https://arxiv.org/abs/2306.05284
You are talking about piano roll notation, I think. While it's 2d data, it's not quite the same as actual image data. E.g., 2d conv and pooling operations are useless for music. The patterns and dependencies are too subtle to be captured by spatial filters.
I am talking about using spectrograms (Fourier transform into frequency domain then plotted over time) that results in a 2d image of the song, which is then used to train something like stable diffusion (and actually using stable diffusion by some) to be able to generate these, which is then converted back into audio. Riffusion used this approach.
IF you think about it, a music sheet is just a graph of Fourier transform. It shows at any points of time, what frequency is present (the pitch of note), and for how long (duration of note),
it is no such thing. nobody maps overtones on sheet, durations are toast, you need to macroexpand all flat/sharps, volume is passed by vibe-words, it has 500+ of historical compost and so on. sheet music to fft is like wine tasting to a healthy meal
A spectrogram is lossy and not a one-to-one mapping of the waveform. Riffusion is, afaik, limited to five-second-clips. For these, structure and coherence over time isn't important and the data is strongly spatially correlated. E.g., adjacent to a blue pixel is another blue pixel. To the best of my knowledge no models synthesize whole songs from spectrograms.
How does Spotify “think” about songs when it is using its algos to find stuff I like?
Does it really need to think about the song contents? It can just cluster you with other people that listen to similar music and then propose music they listen to that you haven't heard.
That's one method they use, but "just cluster" is doing a lot of heavy lifting in that sentence. It's why Erik Bernhardsson came up with the Approximate Nearest Neighbors Oh Yeah algorithm (or ANNOY for short)
> We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.
[0] https://erikbern.com/2013/04/12/annoy.html
[1] https://github.com/spotify/annoy?tab=readme-ov-file
Yeah, I am obviously not a data scientist.
I guess where I was getting at is they do not technically even need to know genres to recommend songs. In practice though, they probably have to know them anyway for playlists, but I assume they can have the song owners provide that when the songs are uploaded, and artists specify it when they create their profile.
This article [0] investigates some of the feature extraction they do, so it's not just collaborative filtration.
[0]: https://www.music-tomorrow.com/blog/how-spotify-recommendati...
I've seen this approach applied to spectrograms. Convolutions do make enough sense there.
Doesn't this more or less boil down to OCR scans of books having more privileged information than a plaintext file? In which case a roundtrip won't add anything?
[0] https://web.archive.org/web/20140402025221/http://m.nautil.u...
This reminds me of how trajectory prediction networks for autonomous driving used to use a CNN to encode scene context (from map and object detection rasters) until vectornet showed up
Exactly. The example the article give of reducing resolution as a form of compression highlights the limitations of the visual-only proposal. Blurring text is a poor form of compression, preserving at most information about paragraph sizes. Summarizing early paragraphs (as context compression does in coding agents) would be much more efficient.
Another great example of this working is the genomic variant calling models from Deepmind "DeepVariant". They use the "alignment pile-up" images which are also used by humans to debug genomic alignments, with some additional channels to further feature engineer the CNN.