At a certain rate we will be able to move towards continuous / real-time inference systems. The discrete, turn based solutions are quite confining with how they must be trained. Continuous and real-time would fundamentally alter the domain.
From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.
We still have the problem that auto regressive decoders are memory bound.
The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server)
Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming.
I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing.
agree, from my POV the constraints are still there but we've optimized now. still haven't solved the core problems.
1000TPS - what model size?
Maverick 400B is what Nvidia used for their claim of 1k+ TPS on Blackwell GPUs.
Is there anyone exploring or writing about this in public? I've felt for a while that the turn-based model was not quite right, but also felt too stupid and ill-informed to have much of an opinion about what else it could be.
Thinking Machines, the started founded by former OpenAI CTO Mira Murati. The interaction models demo’s in their videos imo breaks the awkward turn-based barrier. Returning responses quickly reaches a threshold where it starts to feel like a natural conversation. Their approach to solving this problem is rather clever.
I have an active 'sleep' mode, where when the user is AFK the LLM goes into a loop with a sleep 10 between turns, and determines (via tool use) if something should be done. That's still a 'turn' in a way, but it's all the LLM just sort of sitting around like a human would, pondering what to do next.
But I could imagine after each space(eg, word) having a 27b model on a nice rig, with thinking off, doing a quick look at the sentence and determine if it should interrupt and start a real turn with thinking on. Which kind of is non-turn based in a way. If you're typing fast, it might hit that run every 3 or 4 words, but that's sort of how a human might be when a person is talking to them. That is, waiting for enough info to interrupt, if needed.
There might be a way to process chunks of a sentence using commas as break points, eg for comma delimitated phrases in sentences, so the whole sentence doesn't need to be re-processed each "should I break in" assessment at word break.
Could be fascinating. Could actually do some of this right now.
I don't think this is what the parent poster was thinking, but the idea even at this level seems fun.
Yeah, I've played with some similar stuff on my 9070xt. But ultimately all the ceremony on top is cloaking that it's still just two or more models taking turns prompting each other to give the illusion of continuous thought. It's still one thought at a time, with every thought starting from scratch with a big chunk of prior context.
The idea of true continuous thought and memory-generation is very interesting, though I can't even begin to conceive of how it would work.
Or if it's even correct? Maybe our brains are secretly actually turn based too?
I think they're definitely attention based. They're just immensely faster than LLMs, because a lot of processing is in silicon in a sense. Think of a ball flying towards you, you don't have to think, the data is handed to your conscious mind, speed, direction, which literally knows how to snag the ball out of the air.
But we have multiple things vying for attention, and some are immediate. Being on the phone talking to someone with great attention, and then touching a burning surface -- you immediately pull your hand back (lizard brain) before even being aware you're doing it. The same with peripheral vision and something surprising coming at you from the side. It snags your attention.
So maybe we are turn-ish based, but just multiple parallel processes each with their own turn? Neurons have their own 'trigger', and I think the brain has layers of triggers, each aggregating and filtering up to the top which then triggers.
I think doing this all with an LLM is silly, some of it should be innate, such as peripheral vision. Data handed to the main thread when triggers occur. I wouldn't want an LLM to handle "walking" fully either.
Some octupus have a sub-brain in each tentacle, each thinking and feeling, there are serious questions as to what its mind is like. I feel initial LLM powered androids may have to be like this a bit.
I agree with you however I think even then you're still giving our brains too much credit. The speed definitely comes from that processing being "in silicon".
Your ball throwing example however will be handled by really small and really fast "fine tuned agents" dedicated to catching that ball. Eyes to motor neuron system. There are the illusion of free will experiments that demonstrate your brain only rationalises and explains whatever activity took place after the fact (It's explanation may even be entirely wrong).
That would be interesting.
Do you feel most of the speed upgrade will come from the software or hardware side?
And more importantly those 10 million tokens/s should cost fractions of a penny. Tokens need to be dirt cheap so I hope they build out massive solar+battery powered data centers asap.
No anything but wasteful, weak, expensive, environmentally harmful solar. Nuclear is the only path forward for superior energy production, at least until we figure out fusion.
How is solar any of those things?
Your comment made me think of another real time. Real time, dynamic code/apis.
Imagine a world where there is no code, just things mildly handshaking and then creating data APIs on the fly. Where communication is fuzzy and locked in on an individual basis. No years of RFCs, no RFCs at all, just... data.
Just data, man.
An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance.
Why remove the code and binary artifacts, though? Don't you want to verify that the business logic is accurate and the processing is deterministic?
In some circumstances there is no substitute for something that you know will produce the same answer for a given input, consistently. And that's before even considering the watts per response.
The AI is the business logic, and the processing, and all of it. The context window is effectively infinite, with layered context window depth and speed.
Think of short and long term memory, or think of RAM vs SWAP. Dip into swap to pull needed data into RAM context. SWAP can be anything storage related, including a symbolic database or a best-encoded set of priorities.
If a person knows 100 knots, but hasn't tied one in 23 years, they might have to think a bit before they get full use of their long term memory... and tie that knot. I don't see an issue with layered speed context, that is, GPU ram, slower RAM, DB storage, all in the same format.
Imagine a world where a 'factory' is just high-tech 3d printing, with a dozen different methods (eg, plastic, laser+metal, etc), and getting specs for everything possible is, well, an immense amount of work. Imagine having a billion item catalog of things to print, and, imagine new requests for new things to print.
And the request doesn't come from an expert, but from some dude who sketched something on the back of a cardboard box.
The LLM can pull from long term storage for how those things were done before, how similar things were done before, and just get to work.
Regardless, the connection was what I was talking about before. Data transfer. Do you need http? json once established? What? Imagine instead that's all in the wind?
And it's so fast, so capable, that dynamic is easy.
Feels like the universe did that and life spat out. Theres going to be a structure
It's very easy to see how world changing this technology will be. In a few years these AIs are going to be negotiating how they communicate with each other. Humans won't necessarily be included in that negotiation unless we have some kind of specific reason to. So many communication layers are going to be opaque to humans. We just have to trust our AIs are communicating efficiently and safely.
It will be fun running into this scenario where it's run without democratic control, be proprietary and for profit.
I'm pretty sure the LLM will get fed up and start writing an RPC
Also > An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance
Cool that you wrote all the words starting with "a" but I don't understand what you mean
What this made me think of is life before computers, where people mildly handshake, create agreements on the fly. "Where communication is fuzzy and locked in on an individual basis."
TBH, to me, this imagined future looks a lot like it'd have all the problems we already have.
I made this https://github.com/alehlopeh/hallu
Neat. Not precisely what I was thinking, but 100% definitely very cool and the same mental scope. It's like we wear different shoes, but go to the same cobbler.
I can imagine shoe-horning* this so the agent saves prior builds of every successfully delivered or deployed item. In my example, perhaps if someone orders new design $x, it's shipped, and review is 4+ stars, it gets added as 'successful builds'.
* have to keep with the shoe theme, even though shoe-horning is not really necessary
Wow. Sci-fi stuff!
I’ve thought about this before. No flaky config files, no updating endpoints, no status monitors. Just fuzzy everything that works almost all of the time.
Ahh yes slop at the speed of light, how useful!
AI is improving and seems to be reaching the point of not being slop (I am talking about flagship models).
If you’re still calling it slop at this point you have an axe to grind.
Do you use LLMs for anything but code?
I do, a lot of things, it’s extremely useful.