AR has a lot of limitations, that's what they were trying to sidestep. To be able to take a frame of the environment, composite the virtual frame and the real one in unison with various blending ops at your disposal. And finally, present the user with a properly composited image that works anywhere.
Now, if you use multiple lenses, with one masking in grayscale the real frame positioned just before the lens in which you project the virtual frame, you can do some limited blending ops. But it's quite difficult to deal with parallax if the lenses aren't glued together, or even then with refraction.
Apple took a very conservative approach. And made it to market. Now look at the competition. Sure, they have concepts with good compositing, but the ones in the market right now aren't able to produce the same imagery that Apple Vision is capable of.
Maybe I'm wrong, haven't looked at the market in some months, but I always thought of the Apple Vision as a very pragmatic design trying to circumvent AR limitations by being a VR headset.