Hacker News

kgeist a day ago [ - ]

I administer a simple AI server in the office, which just uses a single RTX 5090 but is able to serve ~80 people throughout the day. I'm impressed by Qwen3.6-27b's capabilities in agentic coding/tasks so far. Devs say it's not much different from Sonnet 4.6 on many tasks (sometimes it even outperformed it), 40-60 tok/sec, up to 260k context. The server cost about $10k with all the bells and whistles.

I spent a lot of time researching/adding/benchmarking many custom modifications to the software stack and its settings to make the server optimally handle the load with just 1 RTX 5090 without losing quality, but it's still not enough, and the wait times in the queue are getting longer. We're at the limits of the hardware, and I'm out of tricks.

The experiment was kind of a success, and the CTO agrees we should scale it. With our own infra, we could run agents 24/7 on everything. Currently, a lot of use cases for the cloud providers are completely blocked by PII/trade secret concerns (our infosec department doesn't buy the "zero retention" promise), plus you don't have to think about billing/budgets/etc. anymore.

Now I can't decide how to scale it. On one hand, I'd like to run larger models. And we have the budget to buy, say, 8xH200. But in many benchmarks, the larger models that do fit in 8xH200 comfortably and can serve many parallel requests with acceptable speed/quality don't seem to outperform Qwen3.6 that much in agentic coding/tasks to justify the price.

So another option is just to buy a bunch of RTX 6000s and scale horizontally instead: run a copy of a midrange LLM like Qwen3.6 on each GPU. It's cheaper and easier to scale/replace, but then we'll run into problems running larger models in the future if we have to, because of no NVLink support (say, if Alibaba & Co. stop releasing ~30b models and/or ~30b models start falling behind 400b+ models considerably)

Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)

anon373839 19 hours ago [ - ]

> our infosec department doesn't buy the "zero retention" promise

They are wise to be skeptical! It is neither a promise nor zero data retention.

Look at Anthropic's Zero Data Retention policy -- and remember, this is the policy that applies to the exclusively eligible enterprise partners who can even qualify for a ZDR agreement with Anthropic:

> When ZDR is enabled, prompts and model responses generated during Claude Code sessions are processed in real time and not stored by Anthropic after the response is returned, *except where needed to comply with law or combat misuse*.

> Even with ZDR enabled, Anthropic may retain data where required by law or to address Usage Policy violations. If a session is flagged for a policy violation, *Anthropic may retain the associated inputs and outputs for up to 2 years*....

This means that Anthropic is actively inspecting all of your data with machine learning classifiers. When the usage is flagged for whatever reason as violating any aspect of Anthropic's Usage Policy, then they get to keep your data for 2 years, with no apparent limitation on what they can then use it for.

Crucially, you have ZERO guarantees about the sensitivity or specificity of these classifiers. For all anyone knows, Anthropic is silently flagging 75% of queries and retaining the data.

https://code.claude.com/docs/en/zero-data-retention

zenapollo 10 hours ago [ - ]

I wonder how aws handles this in bedrock. Do they use Anthropics classifiers? Or their own? Or none? Would their data policing be different in bedrock than their other services?

random3 18 hours ago [ - ]

I think it’s a cost/opportunity tradeoff at best with any agreement, regardless. The rest of the contract may make it difficult to impossible to do anything about it, starting with basic arbitration clauses and ending in a ton of other provisions that can make any legal action futile. I doubt there’s much room to negotiate too.

Given that all labs need to diversify to become profitable, they’ll end up competing with their customers and theres nothing that exposes a business more than having AI offload every job function for every account, every mail etc.

Assuming this won’t be an issue is naive at best.

CobaltFire a day ago [ - ]

I have a 5090 machine sitting idle that I'm considering turning into a machine for my own small team (3 devs).

Are you willing to share any lessons learned, etc. that I could make use of? We are evaluating paying for a SOTA sub or trying this, and the talk about Qwen3.6-27B makes me want to try deploying this machine.

gpt5 17 hours ago [ - ]

Sell the machine for $4K, use it to pay for Codex Pro for everyone for a year. Everyone will be significantly more productive and happy.

It's not even a real comparison if they are actually using them for coding.

If you are deploying always running agents (e.g. monitoring logs and services) then sure - a QWEN local server is a good choice. But for coding the cost in productivity of using a lower performing model is way too high.

mixermachine 13 hours ago [ - ]

The 5h quota of Codex Pro on GPT 5.4 Medium lasts me for around an hour and a half, maybe 2 hours. And this is already the "savy" setup. Enable GPT 5.5 High fast and you will be beached in 30 minutes with active development.

For continues all day work you definitely need a higher tier sub level.

I'm actually looking into deploying a GPU at my company because we can not give out our code. Qwen 3.6 looks good

gpt5 12 hours ago [ - ]

this might be true for the plus account. For the "Pro" tier ($100-$200/month) the 5h limit is never a problem.

mixermachine 10 hours ago [ - ]

Right, I did swap that. Still, you have to pay that 4k then every year and give out the code. I also assume that prices will go up as no AI company (but NVIDIA -> selling shovels) is currently making any money.

For some projects the giving out the code part might be ok (i use Codex there too) but for the core app at the company I'm working at there is currently a strict no-AI policy. A local GPU solves this.

59nadir 15 hours ago [ - ]

Anyone who frivolously suggests throwing away possible independence in favor of dependence on a Silicon Valley company is either incredibly naïve or acting in bad faith.

nine_k 5 hours ago [ - ]

Not necessarily so. I can see how a bid to predict how thing will be in 1 year in AI-based coding is likely a losing one. So the idea is to extract the maximum value now, and turn it into profits that would buy you whatever is adequate for the next steps. For comparison, the AI-based coding landscape a year ago, in May 2025, wasn't even close to what we have now, and half the key tools did not exist.

OTOH, as we see, the larger models demonstrate diminishing returns, smaller models demonstrate improvements, and hardware does not show any signs of becoming cheaper, so holding on existing decent GPUs may, too, be a winning strategy in longer term.

gpt5 12 hours ago [ - ]

I'll choose not to respond to your personal attack.

But in term of actually running a dev team - you are free to use QWEN or another quantized local model that can run on an RTX 5090 for coding if it makes you feel more independence. However you would struggle and spend many many more hours achieving the same thing, with a lot more debugging time, long delays before it's done, and many more prompts.

It's just not the right approach. I use QWEN and other local models all the time, but for more clearly defined monitoring and classification tasks.

biddit 17 hours ago [ - ]

> Does anyone here have experience running large models in a multi-GPU setup with several RTX 6000s in a high-concurrency regime and with large context lengths? (something like Deepseek 4 Flash, Minimax 2.7 etc.)

Join the RTX6kPRO tribe!

- https://discord.gg/pYCvaQTf

- https://github.com/local-inference-lab/rtx6kpro

r0b05 15 hours ago [ - ]

How can a single 5090 serve 80 people? Something doesn't add up here.

kgeist 13 hours ago [ - ]

They don't use the server all at once. In the UI, users typically ask a question, get a response, and continue with their work. In the case of autonomous agentic loops, an agent simply waits its turn until the server is ready to accept the request. Agents don't hammer the server 24/7 every second either, because they either need to be triggered or are busy doing other work, such as compiling or running tests.

r0b05 12 hours ago [ - ]

It would be more interesting to know how many simultaneous users this setup can serve. Otherwise I can just say it serves 500 users but not all of them use it at the same time which doesn't communicate the right level of detail.

p1esk 5 hours ago [ - ]

Depends on TTFT and tokens per second you want.

jvidalv 13 hours ago [ - ]

I also call this "bollocks" there is no way this workflow is even 1/10 of what you can get with Codex/Claude Code.

A normal engineer may be running a couple of sessions with every session spawning sub agents left and right.

80 persons or even 10 having this workflow on this setup doesn't work, and this is the standard engineer workflow today.

zozbot234 12 hours ago [ - ]

Subagent swarms are actually great for the local inference scenario because they can share a whole lot of KV cache. You get to raise the compute intensity of decode (i.e. the aggregate tok/s) essentially for free.

mixermachine 13 hours ago [ - ]

With parallelism of 16 you can still get around 25 to 30 tokens per user when all 16 channels are running. Not everyone will use the model at the same time but it certainly will be tight, especially for agentic coding. For pure chat applications this should be quite fine.

zozbot234 12 hours ago [ - ]

The problem with wide parallelism with most models is that it blows up your KV cache. There's open models with KV caches lean enough to parallelize inference or even to offload the KV cache itself to disk without immediately running into wearout concerns, but they're quite exceptional.

hacker_homie 14 hours ago [ - ]

They are using it as an assistant, bot running multiple fully automated agents loops?

throawayonthe 11 hours ago [ - ]

> don't seem to outperform Qwen3.6 that much in agentic coding/tasks

idk i imagine you'll hit less edges with a larger model just because.. more data

if you think of them as a kind of NN compression, it's ~obvious that the larger model can have more stuff encoded in it and hopefully accessible

i don't use LLMs much right now but using midrange models seems like an unnecessary compromise in most cases, especially since the big open models sound to be rivaling opus and not just sonnet :p

iamtheworstdev 4 hours ago [ - ]

I thought NVLINK didn't matter anymore because of the latest PCI-E speeds. Am I wrong there?

zozbot234 a day ago [ - ]

Wouldn't that be a fairly ideal setup for layer parallelism? That doesn't need the high-performance communication of tensor parallelism, and the high-concurrency regime would make it easy to keep the pipeline full with microbatches. You'd also be able to scale out your KV cache storage since that naturally splits layer-wise.

reissbaker 20 hours ago [ - ]

Qwen 3.6 27B is fine but it's not in the same ballpark as GLM-5.1 or Kimi K2.6.

If you truly want to scale up, you should get the 8xH200 with NVLink.

ramshanker 19 hours ago [ - ]

Thank you for the insight. This makes me feel confident, the L40S we are about to acquire with 48GB VRAM for engineering application should be useful for agentic coding as well.

CamperBob2 a day ago [ - ]

For what it's worth, I've been seeing ~100 tps with 4-bit MiniMax 2.7 on two RTX 6000 boards, just running under llama-server without any optimization effort at all. I have no serious long-context experience with that setup, but at 30K context it's still above 90 tps.

If you are happy with Qwen 3.6 27B, I would personally switch the 5090 out for 2x RTX 6000s and keep running 27B. That will give you ~2x your current throughput with a lot more headroom for multiple users. More important, it would buy time to see how things develop over the next few months before you spend a whole lot of money.

nicman23 11 hours ago [ - ]

> 260k context

with a single 5090?

kgeist 9 hours ago [ - ]

Yep, Gated DeltaNet in Qwen3.6 requires much less VRAM for the KV cache than previous generations. Plus the KV cache is 8-bit.

nicman23 8 hours ago [ - ]

is it in llama.cpp?