Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
Its so funny, these "toy models" would be the wet dreams of researchers not 5 years ago.
Progress marches without mercy.
Yeah people don't realize these "toy models" now completely destroy gpt-4o on most tasks, and no one called gpt-4o a toy model back in the day... It was OpenAI's flagship model from 2024 to 2025.
Tbh in 2024 most were calling these models useless for programming and a scam. It wasn't until this year things really changed. My experience with Qwen 3.6 is it can do things, and it's super impressive it can do things, but it's not any more productive than doing it myself.
Hello, it's the internet calling, today is that day.
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
You need it to run in about 8 GB so you have extra space for the context window.
They can be ran on 32GB with 8GB VRAM. I don't think these will be on 16GB for a while. (35B MoE)
I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like this. Are you able to expand on that?
use llama.cpp with cuda
The problem may be that it's a 7800XT which handles memory contention by freezing.