I'm surprised no one has else has mentioned - low power mode.
With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.
If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.
I almost always keep my laptop on low power mode.
Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.
> Wish there was a way to enable low power on a per-app basis.
Since you can control the low power mode setting from the command line: `sudo pmset -a lowpowermode 1`.
It should be pretty straightforward to hook this up to Hammerspoon[1] using hs.application.frontmostApplication() to apply the setting based on whatever foreground application you choose.
Thinking out loud, that being said, the necessity of sudo might make this slightly more complex. An always on background admin agent might be needed I suppose to bypass the password prompts (or add pmset to the sudoers file, if you prefer).
[1]: https://www.hammerspoon.org/
Unfortunately doesn't cover scrolling HN while the agent toils away.
Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).
Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!
It is less efficient use of the GPU and uses more electricity overall, no?
Oh no, 0.6 kWh a day!
Yes, this is a tradeoff that foregoes the efficiency of race-to-idle.