That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.
> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025
I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.
DeepSeek's MLA paper was published in 2024: https://arxiv.org/abs/2405.04434
DeepSeek's Sparse Attention paper was published in February: https://arxiv.org/abs/2502.11089
DeepSeek 3.2 Exp (combining MLA and DSA) was released in September.
You also had several other Chinese hybrid models, like Qwen3 Next and Minimax M1.
That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.
> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025
I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.
[1] https://huggingface.co/ibm-granite/granite-4.0-tiny-base-pre...
[2] https://youtu.be/Q5baLehv5So?t=6075