I'm currently working through research and testing for an article on Ars about the Spark and what things one might do with it, and I've kind of stumbled into a two-LLM agentic setup with Qwen3.6-35B-A3B (via nvidia/Qwen3.6-35B-A3B-NVFP4) as the planning agent and the FP8 version of Qwen3-Coder-30B-A3B-Instruct (Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8) as the coding agent that the planner delegates tasks down to. I'm sticking with vLLM as the inference engine, and I've got it wired together into a 2-agent loop with Opencode.
The Qwen3.6-35B-A3B planner hums along at 50-55 tokens/s, and the Qwen3-Coder-30B-A3B-Instruct coder does 30-35. With both agents up and ready to work, RAM consumption sits at about 112 of 128GB.
It's pretty okay. I'm faffing around with having it disassemble old MS-DOS games from the 1980s, which is a task that lends itself well to the setup. It's not the fastest thing in the world, but with the planner's context window at 256k tokens and the coding agent at 128k, they chew through pretty long task lists handing things back and forth without complaint. The only real issue is that even with really tightly scoped prompts, the coding agent tends to hallucinate like it's on LSD. But the planning agent appears to be quite good at spotting the hallucinations and re-parceling work back to the coder.
It's neat. I'm going to be sad when I have to return the review unit in a couple of months.
edit - I also have been fiddling with Deepseek v4 Flash via Antirez's setup (https://github.com/antirez/ds4), and it's pretty fantastic (and fantastically easy to get running). It's pretty pokey on the Spark, though, at 14-ish tokens/sec. And unless you have a second Spark, it's going to be the only model you run at one time, as it eats alllll the rams.
Long time Ars reader, looking forward to your article (and have a few DOS games to reverse in mind already)!
Is this with a Ghidra MCP or some other technique? And why two models - did you try using Qwen3.6-35B-A3B for everything? (Or 27B or a bigger model since you have the RAM for it)
I haven't paired it with Ghidra MCP; because the games are relatively tiny (I'm starting with one of my personal favorites, Karl Buiter's Sentinel Worlds I: Future Magic, which is like <700KB all in), I made a first baseline pass with Fable a couple of days ago while it was still working and it created a bunch of tiny python tools with Capstone. Qwen picked those right up and has had equal success with them. I might try adding Ghidra into the mix, but it seems overkill at the moment.
I went with a pair of models primarily just to see if I could make it work. It's been fine, but I'm going to rip out the smaller coder model today and try it with just the bigger thinking Qwen model wearing both planner & coder hats in the same loop, just with only the bigger model running.
I'm learning a lot, and primarily what I'm learning is that I'm not a developer and this stuff gets real complex real fast, especially in chasing down all the details needed to make sure I'm taking advantage of the spark hardware!