To clarify: I didn't mean a model that would "translate" animal sounds to some representation of language or meaning. I meant a model that would capture statistical regularities in animal sounds and perhaps be able to link these to contextual information (e.g. time of day, other animals around, season etc).

By almost trivial I mean it wouldn't require much new technology. Something like WaveNet or VQ-VAE could be applied almost out of the box.

Data availability is may be a significant problem, but there are some huge animal sound datasets. E.g. https://blog.google/intl/en-au/company-news/technology/a2o-s...