I'm sure the model is capable, but I find it funny that the sample image that contains three bears gets detected as two elephants.

It’s an accurate representation of the model capabilities in my experience.