The NPU is generally pretty weak and not pipelined into the GPU's logic (which is already quite large on-die). It feels like the past 10 years have taught us that if you're going to create tensor-specific hardware then it makes the most sense to put it in your GPU and not a dark-silicon coprocessor.