Based on that list it boils down to 2 things it seems:

- cost (no longer a problem)

- too much code needed and it bloats the data pipelines. Does anyone have any actual evidence of this being the case? Like yes, code would be needed, but why is that innately a bad thing? Bloated data pipelines feels like another hand-wave when I think if you do it right it’s fine. As proven by Waymo.

Really curious if any Tesla engineers feel like this is still the best way forward or if it’s just a matter of having to listen to the big guy musk.

I’ve always felt that relying on vision only would be a detriment because even humans with good vision get into circumstances where they get hurt because of temporary vision hindrances. Think heavy snow, heavy rain, heavy fog, even just when you crest a hill at a certain time of day and the sun flashes you

Just for the record though, Musk isn't blindly anti-LIDAR. He has said (and I think this is an objective fact) that all existing roads and driving are based on vision (which is what all humans do). So that should technically be sufficient. SpaceX uses LIDAR for their docking systems.

I would argue that yes, we do use vision but we get that "lidar depth" from our stereo vision. And that used to be why I thought cameras weren't enough.

But then look at all the work with gaussian splatting (where you can take multiple 2d samples and build a 3d world out of it). So you could probably get 80% there with just that.

The ethos of many Musk companies (you'll hear this from many engineers that work there) is simplify, simplify, simplify. If something isn't needed, take it out. Question everything that might be needed.

To me, LIDAR is just one of those things in that general pattern of "if it isn't absolutely needed, take it out" – and the fact that FSD works so well without it proves that it isn't required. It's probably a nice to have, but maybe not required.

Humans aren't using only fixed vision for driving. This is such a tiresome thing to see repeated in every discussion about self driving.

You're listening to the road and car sounds around you. You're feeling vibration on the road. You're feeling feedback on the steering wheel. You're using a combination of monocular and binocular depth perception - plus, your eyes are not a fixed focal length "cameras". You're moving your head to change the perspective you see the road at. Your inner ear is telling you about your acceleration and orientation.

And also, even with the suite of sensors that humans have, their vision perception is frequently inadequate and leads to crashes. If vision was good enough, "SMIDSY" wouldn't be such an infamous acronym in vehicle injury cases.

For those of us not aware of Australian cycling jargon, "SMIDSY" means "Sorry, Mate, I Didn't See You".

the issue is clearly attention not vision when it comes to humans. if we could actually process 100% of the visual information in our field of view, then accidents would probably go down a shit load.

Attention is perhaps the limiting factor, but being able to look in two directions at once would help, and would help greatly if we had more attention capacity. E.g. anytime you change lanes you have to alternate between looking behind, beside, and in front and that greatly reduces reaction time should something unexpected happen in the direction you aren't currently looking...

Humans have both issues. There are many human failures which are distinctly a vision issue and not attention related, e.g. misestimation of depth/speed, obscured or obstructed vision, optical focus issues, insufficient contrast or exposure, etc.

But how many of those crashes not caused by inattention could have been avoided with less idiocy and more defensive driving? I mean, yes, we can’t see as well in fog, but that’s why you should slow down

Again, I'm still not saying that humans don't make bad decisions. I'm saying that, unequivocally, they also get into accidents while paying attention and being careful, as a result of misinterpretation or failure of their senses. These accidents are also common, for example:

* someone parking carefully, misjudges depth perception, bumps an object

* person driving at night, their eyes failed to perceive a poorly lit feature of the road/markings/obstacles

* person driving and suddenly blinded by bright object (the sun, bright lights at night)

* person pulling out in traffic who misinterprets their depth perception and therefore misjudges the speed of approaching traffic

* people can only focus their eyes at one distance at a time, and it takes time to focus at a different distance. It is neither unsafe nor unexpected for humans to check their instruments while driving -- but it can take the human eye hundreds of milliseconds to focus under normal circumstances -- If you look down, focus, look back up, and focus, as quick as you can at highway speeds, you will have travelled quite a long distance.

These type of failures can happen not as a result of poor decision making, but of poor perception.

In theory, a computer should be able to do the same. It could do sensor fusion with even more sense modalities than we have. It could have an array of cameras and potentially out-do our stereo vision, or perhaps even use some lightfield magic to (virtually) analyze the same scene with multiple optical paths.

However, there is also a lot of interaction between our perceptual system and cognition. Just for depth perception, we're doing a lot of temporal analysis. We track moving objects and infer distance from assumptions about scale and object permanence. We don't just repeatedly make depth maps from 2D imagery.

The brute-force approach is something like training visual language models (VLMs). E.g. you could train on lots of movies and be able to predict "what happens next" in the imaging world.

But, compared to LLMs, there is a bigger gap between the model and the application domain with VLMs. It may seem like LLMs are being applied to lots of domains, but most are just tiny variations on the same task of "writing what comes next", which is exactly what they were trained on. Unfortunately, driving is not "painting what comes next" in the same way as all these LLM writing hacks. There is still a big gap between that predictive layer, planning, and executing. Our giant corpus of movies does not really provide the ready-made training data to go after those bigger problems.

Putting your point another way, in order to replicate an average human driver’s competence you would need to make several strong advancements in the state of the art in computer vision _and_ digital optics.

In India (among others), honking is essential to reducing crashes

We often greatly underestimate / undervalue the role of our ears relative to vision. As my film director friend says, 80% of the impact in a movie is in the sound

The day a Waymo can functionally navigate the streets of Mumbai is when we really have achieved l5

Most of what you said has nothing to do with lidar vs camera

20 meters away motion vision is more accurate than stereoscopic vision. What is lidar helping to solve here?

Waymo claims its system, which uses a combination of LIDAR & vision, resolves objects up to 500 meters away

https://waymo.com/blog/2024/08/meet-the-6th-generation-waymo...

This company claims their LIDAR works conservatively at 250m, and up to 750m depending on reflectivity

https://www.cepton.com/driving-lidar/reading-lidar-specs-par...

> So that should technically be sufficient

Sufficient to build something close to human performance. But self driving cars will be held to a much higher standard by society. A standard only achievable by having sensors like LiDAR.

if a self driving car had the exact vision of humans it would still be better because it has better reaction times. never mind the fact that humans cant actually process all the visual information in our field of view because we dont have the broad attention to be able to do that. its very obvious that you can get super human performance with just cameras.

Whether thats worth completely throwing away LiDAR is a different question, but your argument is just obviously false.

This reminds me of the time I was distantly following a Waymo car at speed on 101 in Mountain View during rush hour. The Waymo brake lights came on first followed a second or two later by the rest of the traffic.

Better reaction times only matter if the decisions are the same / better in every case. Clearly we are not there on that aspect of it yet.

Deciding to crash faster, or "tell human to take over" really fast is NOT better.

Even if they weren’t going to be held to a higher standard for widespread acceptance, tens of thousands of people a year in the us die due to humans driving badly. Why would we not try to do better than that?

Because that's an acceptable loss and better costs more!

LIDAR also struggle in heavy rain, snow, fog, dust. Check how waymo handle such conditions.

It's not only failing, it's causing false positives.

Sufficient if all else were equal. But the human brain and artificial neural networks are clearly not equal. This is setting aside the whole question of whether we hope to equal human performance or exceed it.

Teslas have at least 3 forward facing cameras giving them plenty of depth vision data.

They also have several cameras all around providing constant 360° vision.

To do gaussian splatting anywhere near in real time, you need good depth data to initialize the gaussian positions. This can of course come from monocular depth but then you are back to monocular depth vs lidar.

Mentioning gaussian splatting for why we don't need lidar depth is a great example of Musk-esque technobabble; surface level seemingly correct, but nonsense to any practitioner. Because one of the biggest problems of all SfM techniques is that the results are scale ambiguous, so they do not in fact recover that crucial real-world depth measurement you get from lidar.

Now you might say "use a depth model to estimate metric depth" and I think if you spend 5 minutes thinking about why a magic math box that pretends to recover real depth from a single 2D image is a very very sketchy proposition when you need it to be correct for emergency braking versus some TikTok bokeh filter you will see that also doesn't get you far.

This is not really true if you have multiple cameras with a known baseline, or well known motion characteristics like you get with an accelerometer+ wheel speed.

Why is this getting downvoted? It's good faith and probably more accurate than not.

> and the fact that FSD works so well without it proves that it isn't required

The reports that Tesla submits on Austin Robotaxis include several of them hitting fixed objects. This is the same behavior that has been reported on for prior versions of their software of Teslas not seeing objects, including for the incident for which they had a $250M verdict against them reaffirmed this past week. That this is occurring in an extensively mapped environment and with a safety driver on board leads me to the opposite conclusion that you have reached.

If Waymo proven their model works, why the silly automaker is doing several orders of magnitude more autonomous miles?

They aren't. Tesla has logged some 800k total miles with their robotaxi vehicles, including miles with safety drivers. Waymo has logged 200M driverless miles. That's 0.4% of the mileage, with the most generous possible framing.

My understanding is that there's more data processing required with cameras because you need to estimate distance from stereoscopic vision. And as it happens, the required chips for that have shot up in price because of the AI boom.

But I think costs were just part of the reason why Elon decided against Lidar. Apparently, they interfere with each other once the market saturates and you have many such cars on the same streets at the same time. Haven't heard yet how the Lidar proponents are planning to address that.

How does Waymo handle it now? There are many videos of Waymo depots with dozens of cars not running into each other.

Lidar critics like to pretend that anti-collision is not a well-studied branch of Computer Science and telecoms. Wifi, Ethernet and cellphones all work well simultaneously, despite participants all sharing the same physical medium.

[deleted]