This all makes a lot sense, thanks for sharing!

For depth I agree on the VLA route but for ACT / DP-style imitation learning from scratch it seems more feasible (since you’re not fighting a pretrained model that was not trained on this modality). Might also increase robustness since you naturally end up with an input that’s invariant to colors / textures. Plan is to try both paths: the from scratch (and then ablate RGB vs RGB-D) and the VLA + fine-tuning one.

If you're using depth, you're better off starting with a diffusion policy (DP). We benchmarked ACT, DP, pi0,pi05 on the same task, ACT underperformed in most cases.

There is already plenty of research around multimodal diffusion policies. While DP typically doesn't require pre-training, you can boost data size by depth estimation model+Open data.