I thought about this and I think it boils to how the model is trained.
Tesla trains it models from actual drivers purely based on (input) Vision and (output) actuators - Brake, Steering, Accelerators.
Human output is based on what they and the camera sees. So, it's a 1:1 match.
If Waymo were to do that, it'll muddle the training set. The Lidar input may override camera input.
I always struggled when Musk mentioned Lidar will make it ambiguous. It didn't make any sense to me why having a secondary failback sensor messes things. But, if you put it in the training data context, it absolutely makes sense.
This is an interesting viewpoint, but isn't it also solveable?
Just because the human in the scenario only took vision as input, why does that matter to the training data and the model? The actions are the same.
To put it another way, what about all the cultural context the human had, or the sounds, smells, past experiences at the same intersection, etc? Even Tesla can't record this, but I'm not sure that matters.
The biggest issue with using both camera and lidar is how to properly resolve conflicting returns from different sensor types.