Two things: 1) we have abundant training data for humanoid embodiments (watch humans do things), and 2) the world is already designed for humans.

#1 is the main reason. There is basically unlimited data of things being done with human bodies, and it's also the easiest data to collect (tell a human to do something).