You're most likely bottlenecked by memory bandwidth for a LLM.

The AMD AI MAX 395+ gives you 256GB/sec. The M4 gives you 120GB/s, and the M4 Pro gives you 273GB/s. The M4 Max: 410GB/s (14‑core CPU/32‑core GPU) or 546GB/s (16‑core CPU/40‑core GPU).

It’s both. If you’re using any real amount of context, you need compute too.

Yeah, memory bandwidth is often the limitation for floating point operations.