While it is not at all practical to train an LLM with tens or hundreds of billions of parameters on hobbyists hardware, what if there are other architectures that perform just as well but are easier to train by 1000 volunteers?

I always wondered if 1000 1M parameter models fine-tuned to specific tasks with a small router could perform as well as 100B models.

And I know this is roughly how MoE works, but current MoE models still require training the model as a whole, and big players don’t have an incentive to change that.

But OpenSource community does…

It is practical, albeit not as efficient: https://arxiv.org/abs/2603.08163 . But organizing enough people with decent-enough GPUs is the challenge.