Adding a depth channel rarely yields a massive performance gain, likely due to data scarcity and the fact that modern VLAs are good at guessing distance directly from RGB. I have used multiple RGB-D cameras, but it is hard to get stable images without jitter. Depth can still be useful for high-level reasoning. PI also uses bounding-box or segmentation data from PI-05 for that.

PI smartly combined discretized tokens with flow-matching for efficient training, and it works well in most cases. Still, end-effector representation may be better for teleop with devices like a SpaceMouse, VR, or VibeTracker. PI-07 also supports EEF, but I am not sure how much data is needed to fine-tune PI-05 for that.

I'd suggest starting with the default pi05 model. Data strategy is probably more important than model improvements. Since VLA performance is highly dependent on the data/action distribution and it's easy to modify. After that, you can add high-level reasoning like PI05. I visited a Chinese VLA company that already adopted the PI-05 approach, and it works quite well in practice.

This all makes a lot sense, thanks for sharing!

For depth I agree on the VLA route but for ACT / DP-style imitation learning from scratch it seems more feasible (since you’re not fighting a pretrained model that was not trained on this modality). Might also increase robustness since you naturally end up with an input that’s invariant to colors / textures. Plan is to try both paths: the from scratch (and then ablate RGB vs RGB-D) and the VLA + fine-tuning one.

If you're using depth, you're better off starting with a diffusion policy (DP). We benchmarked ACT, DP, pi0,pi05 on the same task, ACT underperformed in most cases.

There is already plenty of research around multimodal diffusion policies. While DP typically doesn't require pre-training, you can boost data size by depth estimation model+Open data.