> such that they can use the right set of sensors in the right environmental conditions

Because this part is really hard, and that's why Tesla abandoned the fusion approach. You cannot possibly foresee all the conditions in which LIDAR or any active sensor will malfunction/return wrong data/return data that's only slightly off for that ONE specific time. And even if it doesn't, you need to trust it to not return noise. And when it does return noise, how do you classify it as noise?

Cameras are passive sensors - they get whatever light comes in and turn it into an image. Camera is capturing shapes that make sense to the neural nets: it's working. See all black/white/red/cannot see any shapes? Camera is not working, exclude it from the currently used set of sensors or weigh it less when applying decisions, because it's returning no signal (and yes, neural nets have their own set of problems).

EDIT: cameras also provide more continuous context: if 1 pixel is off, is clearly bright red in a mostly-green scene where no poles can be identified, the neural net will average it out and discard it as noise. If 1 pixel says "object" in LIDAR, do you trust it to be correct? Perhaps the ray just hit a bird or a fly, but you only see a point, it's a lossy summary of the information you need.

But why can't you apply all that same logic and processing to LIDAR as well. Maybe we're not there yet, but about about in 5-10 years when we are?

There is noise on LIDAR returns too. No one considers a single LIDAR point to be a collision hazard.