Hi, I'm the developer of zerostack!
No, the memory footprint is not beacuse of the context window size: on my benchmarks, with a 128k context loaded, and it jumped from 8MB (without any chat/context loaded) to 11MB.
The reasons why the memory footprint of zerostack are:
- Rust, and not JS/Python, so no interpreters/VMs on top
- Load-as-needed, so we only allocate things like LLM connectors when needed
- `smallvec` used for most of the array usage of the tool (up to N items are stored in stack)
- `compactstring` used for most of the string usage of the tool (up to N chars are stored in stack)
- `opt-level=z` to force LLVM to optimize for binary size and not for performance (even tho we still beat both in TTFT and in tool use time opencode)
'A million tokens of context' is literally Terrabytes of KV cache VRAM on very expensive Nvidia silicon - on the model.
On the Agent, yes, the context window does relate to RAM, because the 'entire conversational history' is generally kept in memory. So ballpark 1M 'words' across a bunch of strings. It's not that-that much.
Claude Code is not inneficient because 'it's not Rust' - it's just probably not very efficiently designed.
Rust does not bestow magical properties that make memory more efficient really.
A bit more, but it's not going to change this situation.
'Dong it in Rust' might yield amazing returns just because the very nature of the activity is 'optimization'.
It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.
The rough estimate is 2 * L * H_kv * D * bytes per element
Where:
* L = number of layers
* H_kv = # of KV heads
* D = head dimension
* factor of 2 = keys + values
The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.
For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.
But regardless of the details, you’re off by several orders of magnitude.
The context window is not on your system. It's on the server with the model. There may be some local prompt caching, of some sort, but you're not locally hosting the context unless you're also locally hosting the model.
That's just the plain text (or whatever files), that's not the context the model is directly working with on the server, which is tokenized, embedded, vectorized and has attention run against those vectors. The local history is generally quite small, the context generally quite a bit larger. A text conversation of a few hundred kilobytes in plain text will be gigabytes in context.
Hi, I'm the developer of zerostack! No, the memory footprint is not beacuse of the context window size: on my benchmarks, with a 128k context loaded, and it jumped from 8MB (without any chat/context loaded) to 11MB.
The reasons why the memory footprint of zerostack are:
- Rust, and not JS/Python, so no interpreters/VMs on top
- Load-as-needed, so we only allocate things like LLM connectors when needed
- `smallvec` used for most of the array usage of the tool (up to N items are stored in stack)
- `compactstring` used for most of the string usage of the tool (up to N chars are stored in stack)
- `opt-level=z` to force LLVM to optimize for binary size and not for performance (even tho we still beat both in TTFT and in tool use time opencode)
- heavy usage of [LTO](https://en.wikipedia.org/wiki/Interprocedural_optimization#W...)
The context window has nothing to do with RAM usage and even if it did, a million tokens of context is maybe 5mb.
'A million tokens of context' is literally Terrabytes of KV cache VRAM on very expensive Nvidia silicon - on the model.
On the Agent, yes, the context window does relate to RAM, because the 'entire conversational history' is generally kept in memory. So ballpark 1M 'words' across a bunch of strings. It's not that-that much.
Claude Code is not inneficient because 'it's not Rust' - it's just probably not very efficiently designed.
Rust does not bestow magical properties that make memory more efficient really.
A bit more, but it's not going to change this situation.
'Dong it in Rust' might yield amazing returns just because the very nature of the activity is 'optimization'.
Rust "denialism" is as annoying as rust evangelism.
Of course any seemingly idiomatic rust is going to run circles around TS transpiled into JIT-compiled JS.
It has nothing to do with local RAM usage. But a million tokens of LLM context is decidedly not 5mb.
The rough estimate is 2 * L * H_kv * D * bytes per element
Where:
* L = number of layers * H_kv = # of KV heads * D = head dimension * factor of 2 = keys + values
The dominant factor here is typically 2 * H_kv * D since it’s usually at least 2048 bytes. Per token.
For Llama3 7B youre looking at 128gib if you’re context is really 1M (not that that particular model supports a context so big). DeepSeek4 uses something called sparse attention so the above calculus is improved - 1M of context would use 5-10GiB.
But regardless of the details, you’re off by several orders of magnitude.
Pretty sure we're talking about the output text, not the tensors.
The context window is not on your system. It's on the server with the model. There may be some local prompt caching, of some sort, but you're not locally hosting the context unless you're also locally hosting the model.
Chat history is kept locally, generally you have to send the 'whole history' to the model 'each turn'.
That's just the plain text (or whatever files), that's not the context the model is directly working with on the server, which is tokenized, embedded, vectorized and has attention run against those vectors. The local history is generally quite small, the context generally quite a bit larger. A text conversation of a few hundred kilobytes in plain text will be gigabytes in context.
Only "generally"? I'm curious what API has moved away from this protocol that seems mode adapted to conversaions with humans than agentic loops.
So the standard API you pass it all along but I think there are some odd open ai apis that are different.