Self-supplied metadata in music catalogs is notoriously shit. The degree to which most rights owners don't give a damn is telling.
Spotify's own metadata is not particularly sophisticated. "Valence", "Energy", "Danceability", etc. You can see from a mile away that these are assigned names to PCA axes which actually correspond pretty poorly to musical concepts, because whatever they analyzed isn't nicely linearly separable.
I can't think of many situations where that would be particularly valuable, considering it favours recent plays and the cutoff date is already almost half a year old.
Self-supplied metadata in music catalogs is notoriously shit. The degree to which most rights owners don't give a damn is telling.
Spotify's own metadata is not particularly sophisticated. "Valence", "Energy", "Danceability", etc. You can see from a mile away that these are assigned names to PCA axes which actually correspond pretty poorly to musical concepts, because whatever they analyzed isn't nicely linearly separable.
Especially since they scraped Spotify's popularity rating as well
I can't think of many situations where that would be particularly valuable, considering it favours recent plays and the cutoff date is already almost half a year old.
Helps train an algorithm to figure out which music is popular, as a training signal
If that's all the issues there are with the dataset, it is probably far and away the best dataset any researcher has ever used.