Hacker News

overgard 16 hours ago [ - ]

Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.

datadrivenangel 15 hours ago [ - ]

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

angoragoats 14 hours ago [ - ]

A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.