80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.

Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).

Being in California electricity alone puts this non-competitive with just paying a cloud though.

I get 28tps for Qwen3.6 27B on a Ryzen AI Max 395+, with enough spare memory to run another two small models on the side. 60tps for 35B. Am surprised this is not more common.

That’s the cost of using a new hardware provider. A single RTX Pro 6000 Blackwell Max-Q will do better than that and be much more usable. I have 2 running DS4 Flash at 160 tok/s with max num seqs 4.

Very interesting though, these Tenstorrent chips. Might get one to experiment with.

How is the software compatibilty with the Tenstorrent cards? Are you stuck using vendor supplied runtimes/models?

It's surprising how little these things come up given the price they go for