Hacker News

That's still behind the times. Even the ancient dinosaur IBM had released a Mamba model [1] before this paper was even put out.

> Granite-4.0-Tiny-Base-Preview is a 7B-parameter hybrid mixture-of-experts (MoE) language model featuring a 128k token context window. The architecture leverages Mamba-2, superimposed with a softmax attention for enhanced expressiveness, with no positional encoding for better length generalization. Release Date: May 2nd, 2025

I mean, good for them for shipping, I guess. But seriously, I expect any postgrad student to be able to train a similar model with some rented GPUs. They literally teach MLA to undergrads in the basic LLM class at Stanford [2] so this isn't some exactly some obscure never-heard-of concept.

[1] https://huggingface.co/ibm-granite/granite-4.0-tiny-base-pre...

[2] https://youtu.be/Q5baLehv5So?t=6075

Palmik 4 months ago [ - ]

DeepSeek's MLA paper was published in 2024: https://arxiv.org/abs/2405.04434

DeepSeek's Sparse Attention paper was published in February: https://arxiv.org/abs/2502.11089

DeepSeek 3.2 Exp (combining MLA and DSA) was released in September.

You also had several other Chinese hybrid models, like Qwen3 Next and Minimax M1.

jychang 4 months ago [ - ]