Natural language wasn't solved by brute force until we started using trillion parameter models and using the whole internet, every book and every article ever published as training data.
I don't know of anyone spending tens of billions on this problem like Microsoft did for OpenAI. First you'd have to build up a dataset of trillions of token equivalents for motion. What that looks like alone is largely guess work. Then you'll need to build a super computer to scale up the current sota motion model to 100 times the size of the biggest model today. Then you'll have to pretrain and finetune the models.
If after all that dexterity still isn't solved all we can say is that we need more data and bigger models.
People seriously don't understand how big big data for AI is and what a moonshot GPT3 and 4 were.
Tesla's approach is "start with motion captured data, move on to first person view video demonstrations, then move on to any video demonstrations - i.e. feed YouTube into the system and hope it learns something from that".
And that led to a car that kills people.
While the companies that have cars moving around safely all used a very diverse mix of human and ML created models. Completely on the face at the Bitter Lesson.