Does anyone have a clue how far we are from having "LLMs for animals"? Even if we don't understand what the LLM is saying to a dolphin or a monkey, does it change much from feeding millions of texts to a model without ever explaining language to it as a prerequisite?
A predictive/generative model of animal "vocalizations" would be almost trivial to do with current speech or music generation models. And those could be conditioned with contextual information easily.
Wouldn't we need several hundred gigabytes of ingestible/structured contextual info for animal vocalizations in order to train a model with any accuracy? Even if we had it, seems to me the model would be able to tell us what sounds probably “should” follow those of a given recording, but not what they mean.
We could train a transformer that could predict the next token, whether it's the next sound from one animal or a sound from another animal replying to it. However, we wouldn't understand the majority of what it means, except for the most obvious sounds that we could derive from context and observation of behavior. This wouldn't result in a ChatGPT-like interface, as it is impossible for us to translate most of these sounds into a meaningful conversation with animals.
Why not label a fine-tuning dataset with human descriptions based on video recordings. We explain in human language what they do, and then tune the model. It doesn't need to be a very large dataset, but it would allow for models to directly translate to human language from bird calls.
What if they just sit and talk? What is the description of this? What if only part of the communication is relevant? What if it's not relevant at all because they reacted to atmospheric changes? Or electromagnetic signals, that can't be observed on video? Or smell? Or sound outside of human hearing frequency? What if the decision based on communication is deferred? etc etc
As I mentioned before, only the most obvious examples of behaviors and context can be translated into anything meaningful.
But then it's not a translation of the bird tweets, but more like a predictive mapping from tweets to behaviors.
Reminds me of Wittgenstein's if a lion could speak, we would not understand it.
Something like this? https://search.acousticobservatory.org/
Generative models yes, since there are terabytes of audio available. High quality contextual info is much harder to obtain. It’s like saying that we could easily build a model for X if we had training data available.
With LLMs we can leverage human insight to e.g. caption or describe images (which was what made CLIP and successors possible). With animals we often have no idea beyond a location. There is work to include kinematic data with audio to try and associate movement with vocalisation but it’s early days.
https://cloud.google.com/blog/transform/can-generative-ai-he...
It's "almost trivial" and "easily" done, I only wonder why we aren't speaking to animals already.
Oh wait. Because the devil's in the details, the ones SW dev hubris glosses over ;) ;)
To clarify: I didn't mean a model that would "translate" animal sounds to some representation of language or meaning. I meant a model that would capture statistical regularities in animal sounds and perhaps be able to link these to contextual information (e.g. time of day, other animals around, season etc).
By almost trivial I mean it wouldn't require much new technology. Something like WaveNet or VQ-VAE could be applied almost out of the box.
Data availability is may be a significant problem, but there are some huge animal sound datasets. E.g. https://blog.google/intl/en-au/company-news/technology/a2o-s...
Someone already mentioned Aza Raskin, but the organisation you should look up is Earth Species Project. It’s a fairly open question and fairly philosophical - do the semantics of language transcend species? Certainly there is evidence that “concepts” are somewhat language agnostic in LLM embedding spaces.
https://www.earthspecies.org/about-us#team
Captivating watch from Aza Raskin on the subject:
https://youtu.be/3tUXbbbMhvk
I had the pleasure of hanging out with him at Stochastic Labs in 2018 while he was working on this, and I was working on 3D fractal stuff there. Pretty fun place, and was my first time living in the US.
At the time it seemed a bit wild / long shot, but now he just looks like a pioneer.
Presumably anyone with a multimodal transformer already pretrained on Human data could be further pretrained on animal vocalizations. I don't know whether any of the large model owners are doing this.