I wonder if there is way local small LLMs can complement each other in away that the sum-total yields a much more performant LLM
I wonder if there is way local small LLMs can complement each other in away that the sum-total yields a much more performant LLM
Sort of like how ants in a colony produce a working "society" that no individual ant could muster.
Perhaps some radical MoE where you download _exactly_ the components you need as you need them. Currently MoE is switched usually on per-token per-layer basis, so you need all weights locally. But e.g. Apple made one which pre-selects all experts based on prompt embedding. That might be further scaled up - e.g. predict exactly what's needed
Perhaps something similar to speculative decoding.
Speculating Experts Accelerates Inference for Mixture-of-Experts: https://arxiv.org/abs/2603.19289