This is cool, but they are still not going about it the right way.

Its much easier to build everything into the compressed latent space of physical objects and how they move, and operate from there.

Everyone jumped on the end-2-end bandwagon, which then locks you into the input to your driving model being vision, which means that you have to have things like genie to generate vision data, which is wasteful.

This is cool, but they are still not going about it the right way.

This is legit hilarious to read from some random HN account.

I posted this before, but Ill post again - this is one of the few things I feel confident enough to say that most people in the space are doing wrong. You can save my post and reference it when we actually get full self driving (i.e you can take a nap in the backseat while your car drives you), because its going to be implemented pretty much like this:

Humans don't drive well because we map vision policy to actions. We drive well (an in general, manipulate physical objects well), because we can do simulations inside our head to predict what the outcome will be. We aren't burdened by our inability to recognize certain things - when something is in the road, no matter what it is, we auto predict that we would likely collide with that thing because we understand the concept of 3d space and moving within it, and take appropriate action. Sure, there is some level of direct mapping as many people can drive while "spaced out", but attentive driving involves mostly the above.

The self driving system that can actually self drive needs to do the same. When you have this, you will no longer need to do things like simulate driving conditions in a computationally expensive sim. You aren't going to be concerned with training model on edge cases. All you would need to to ensure that your sensor processing results in a 3d representation of the driving conditions, and the model will then be able to do what humans do and explore a latent space of things it can do and predict outcomes then chose the best one.

You want proof? It exists in the form of Mu Zero, and it worked amazingly well. And driving can be easily reformated as a game that the engine plays in a simulator that doesn't involve vision, and learns both the available moves and also the optimal policy.

The reason everyone is doing end to end today is because they are basically trying to catch up to Tesla, and from a business perspective, nobody is willing to put money and pay smart enough people to research this, especially because there is also a legal bridge to cross when it comes to proving that the system can self drive while you napping. But nevertheless, if you ever want self driving, this is the right approach.

Meanwhile, Google who came up with Mu Zero, is now doing more advanced robotic stuff than anyone out there.

when we actually get full self driving (i.e you can take a nap in the backseat while your car drives you)

What on earth? We already have this, it’s called Waymo. And the idea that they’re trying to catch up to Tesla is laughable.

The article is about using the world model to generate simulations, not for controlling the vehicle.

They form control policy from vision data directly, which is why they need to have a massive model generate simulation vision data.

System architecture aside, how else would you test end to end behavior other than starting from sensor inputs?

My other comment in the thread explains it.

Basically, driving policy needs to be MCTS search on a space that represents physical objects.

If I were to build a self driving system here is how I would do it:

* Define a 3d representation of the physical space around the car and how it evolves. Basically a very compressed simulator that has an input of initial conditions, and then predicts the evolution of the scene. The big difference here is that you would be manually coding this sim (i.e not training it), because you would be defining rules for things like collisions. You can also conveniently integrate your car control in this sim, with motion based on tire behavior that happens when you turn the steering wheel.

* Build probablistic behaviour of other objects (i.e cars/pedestrians) from real world driving data. I.e given a time span of driving, these essentially represent the probability of what the human pedestrian or the human driver would do.

* On the sensor side, you would train models to take lidar/camera and create the initial conditions for the sim. I.e things like big trucks would map to big trucks with a lot of mass and inertia, things like obstruction on the road would represent essentially "walls" that you cannot hit, and traffic control objects that represent "soft" boundaries.

* On the driver side, you would train something like MuZero to essentially play the driving game within the sim, building both the prediction model at training, and at inference time running MCTS to chose the best optimal policy. Scoring would be done based on things like following traffic control signals and not hitting things, minimizing traffic disturbance, following the GPS route, and so on e.t.c.

And this is how you would get superhuman driving. Just like in the cases where a neural net learns to play a particular game and finds really unique strategies, you would see similar things with this. For example, It would be able to avoid collision situations where you would get rear ended, because it would predict a collision, see that emergency lane is open, and create a control plan to move the car out of the way. And from a product perspective, you can imagine how advantageous this would be in terms of development and improvement.

And to answer your question, you wouldn't really even need to do end to end as a test for bugs, you would just need to make sure your sensor model is accurate, which can be done simply by driving the car and it observing the world. Its much simpler to do than comparable systems because you don't care about what the object is, you just care whether its part of the terrain or not, and if its not, you really just care about its size in terms of taking up space and its trajectory.