In CS algorithms, we have space vs time tradeoffs.

In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

I have spent the last 2.5 years living like a monk to maintain an app across all paid LLM providers and llama.cpp.

I wish this was true.

It isn't.

"In algorithms, we have space vs time tradeoffs, therefore a small LLM can get there with more time" is the same sort of "not even wrong" we all smile about us HNers doing when we try applying SWE-thought to subjects that aren't CS.

What you're suggesting amounts to "monkeys on typewriters will write entire works of Shakespeare eventually" - neither in practice, nor in theory, is this a technical claim, or something observable, or even stood up as a one-off misleading demo once.

If "not even wrong" is more wrong than wrong, then is 'not even right" more right than right.

To answer you directly, a smaller SOTA reasoning model with a table of facts can rederive relationships given more time than a bigger model which encoded those relationships implicitly.

> In LLMs, we will have bigger weights vs test-time compute tradeoffs. A smaller model can get "there" but it will take longer.

Assuming both are SOTA, a smaller model can't produce the same results as a larger model by giving it infinite time. Larger models inherently have more room for training more information into the model.

No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

I even get the larger hosted models stuck chasing their own tail and going in circles all the time.

It's true that to train more information into the model you need more trainable parameters, but when people ask for small models, they usually mean models that run at acceptable speeds on their hardware. Techniques like mixture-of-experts allow increasing the number of trainable parameters without requiring more FLOPs, so they're large in one sense but small in another.

And you don't necessarily need to train all information into the model, you can also use tool calls to inject it into the context. A small model that can make lots of tool calls and process the resulting large context could obtain the same answer that a larger model would pull directly out of its weights.

> No amount of test-retry cycle can overcome all of those limits. The smaller models will just go in circles.

That's speculative at this point. In the context of agents with external memory, this isn't so clear.

Almost all training data are on the internet. As long as the small model has enough agentic browsing ability, given it enough time it will retrieve the data from the internet.

[deleted]

This doesn't work like that. An analogy would be giving a 5 year old a task that requires the understanding of the world of an 18 year old. It doesn't matter whether you give that child 5 minutes or 10 hours, they won't be capable of solving it.

I think the question of what can be achieved with a small model comes down to what needs knowledge vs what needs experience. A small model can use tools like RAG if it is just missing knowledge, but it seems hard to avoid training/parameters where experience is needed - knowing how to perceive then act.

There is obviously also some amount (maybe a lot) of core knowledge and capability needed even to be able to ask the right questions and utilize the answers.

Small models handle simple, low context tasks most of the time correctly. But for more complex tasks, they fail due to insufficient training capacity and too few parameters to integrate the necessary relationships.

What if you give them 13 years?

Nothing will change. They will go out of context and collapse into loops.

I mean the 5yo child, not the LLM

Then they're not a 5-year-old anymore.

but in 13 years, will they be capable?

No. They will go out of context and collapse into loops.

Actually it depends on the task. For many tasks, a smaller model can handle it, and it gets there faster!