I'm kind of interested in a setup where one buys local hardware specifically to run a crap ton of small-to-medium LLM locally 24/7 at high throughput. These models might now be smart enough to make all kinds of autonomous agent workflows viable at a cheap price, with a good queue prioritization system for queries to fully utilize the hardware.
Adding to my own comment now that I've read the announcement in a little more detail: I find the assertion that the model's coding performance surpasses their own flagship 397B model from last generation fairly convincing.
This sounds like significant genuine gains unless one of the following is true, which would be really unlikely:
1. They somehow managed to benchmaxx every coding benchmark way harder than their own last generation.
2. They held back the coding performance of their last generation 397B model on purpose to make this 3.6 Qwen model look good. (basically a tinfoil hat theory as it would literally require 4D chess and self-harming to do)
So, it's pretty save to say that we actually have a competent agentic coding model we can leave on in a prosumer laptop overnight to create real software for almost zero token costs.
I would love to have a shit load of small (27B dense. 35B MoE) agents running locally and looking at and ingesting every bit of data about me, my life and what I get up to see what sort of correlations it finds. Give a coding agent access to a data lake of events and let it build up its own analytics tooling to extract and draw out information from that data, and present it to me as daily/weekly/monthly summaries.
That's definitely doable. Planning similar except more webscraping / newsfeed / monitoring like.
I've got 3x SBCs that can run the Gemma 4 26B MoE on NPU. Around 4W extra power, 3 tokens a second...so that can hammer away at tasks 24/7 without moving the needle on electricity bill
I wonder if some investment firms are already doing this internally at a large scale. (Probably.)
They are - I’ve seen it.
They just use APIs though. There is very little interest within them to do the model engineering and inference in house.
This was along my lines of thinking at one point as well. Though I'm now more interested in having it experiment autonomously on my software projects overnight.
Buy any Strix Halo box and have fun with your 128GB of VRAM.
I wonder whether it is much more cost-effective in terms of token throughput / hardware+power cost to get actual GPUs instead, given that the model size is only 27B.