If you're using depth, you're better off starting with a diffusion policy (DP). We benchmarked ACT, DP, pi0,pi05 on the same task, ACT underperformed in most cases.
There is already plenty of research around multimodal diffusion policies. While DP typically doesn't require pre-training, you can boost data size by depth estimation model+Open data.