someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task
what the bigger wide net bigs you is generalization
someone on twitter was exploring and linked to some related papers where you can for example trim experts on a MoE model if you're 100% sure they're never active for your specific task
what the bigger wide net bigs you is generalization