I wonder if this is what a “Minimally Viable LLM” looks like. I often wonder how much of an LLM do you need before you can just shove a bigger context Window and any dynamic knowledge content to it like a PDF or markdown file to give it knowledge outside of its training data. I feel like LLMs don’t need more data they just need to be refined.
You might be interested in this model. It's a densely trained on math whuch let's it punch way higher than it should https://github.com/WeiboAI/VibeThinker