Hacker News

NPU/Tensor cores are actually very useful for prompt pre-processing, or really any ML inference task that isn't strictly bandwidth limited (because you end up wasting a lot of bandwidth on padding/dequantizing data to a format that the NPU can natively work with, whereas a GPU can just do that in registers/local memory). Main issue is the limited support in current ML/AI inference frameworks.