Is that using the NPU on that board? I know it's possible to use those too.

It is possibly (superb subreddit) but painful to convert a modern model and takes ages for them to be supported. The NPU is energy efficient but no faster than CPU for generation (and has lousy software support).

I’m mostly interested in the NPu to run a vision head in parallel with an LLM to speed up time to first token with VLLMs (kinda want to turn them into privacy safe vision devices for consumer use cases)

Since my comment, I remembered I had a RK3588 board, a Rock 5B, and tried llama.cpp CPU over that, and performance was not amazing. But also I realized this is LPDDR4X, so don't get the cheapest RK3558 boards. My Orange Pi 5 is actually worse. This one has LPDDR4. Looking at the rest of Orange Pi's line-up, they don't actually have a board with both LPDDR5 and 32GB, only 16GB or LPDDR4(X).

Using llama-bench, and Llama 2 7B Q4_0 like https://github.com/ggml-org/llama.cpp/discussions/10879 how does yours compare? Cuz I'm also comparing it with a few a few Ryzen 5 3000 Series mini-pcs for less than 150$, and that gets 8 t/s on this list and I've gotten myself

With my Rock 5B and this bench, I get 3.65 t/s. On my Orange Pi 5 (not B) 8GB LPDDR4 (not X), I get 2.44 t/s.