With packages like this (lots of cores, multi-chip packaging, lots of memory channels), the architecture is increasingly a small cluster on a package rather than a monolithic CPU.
I wonder whether the next bottleneck becomes software scheduling rather than silicon - OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.
Yes there are scheduling issues, Numa problems , etc caused by the cluster in a box form factor.
We had a massive performance issue a few years ago that we fixed by mapping our processes to the numa zones topology . The default design of our software would otherwise effectively route all memory accesses to the same numa zone and performance went down the drain.
Intel contributes to Linux, how is this a problem?
Wrong level of abstraction. NUMA is an additional layer. If the program (script, whatever) was written with a monolithic CPU in mind then the big picture logic won't account for the new details. The kernel can't magically add information it doesn't have (although it does try its best).
Given current trends I think we're eventually going to be forced to adopt new programming paradigms. At some point it will probably make sense to treat on-die HBM distinctly from local RAM and that's in addition to the increasing number of NUMA nodes.
Often the Linux scheduling improvements come a year or two after the chip. Also, Linux makes moment-by-moment scheduling and allocation decisions that are unaware of the big picture of workload requirements.
I don't think there are any fundamental bottlenecks here. There's more scheduling overhead when you have a hundred processes on a single core than if you have a hundred processes on one hundred cores.
The bottlenecks are pretty much hardware-related - thermal, power, memory and other I/O. Because of this, you presumably never get true "288 core" performance out of this - as in, it's not going to mine Bitcoin 288 as fast as a single core. Instead, you have less context-switching overhead with 288 tasks that need to do stuff intermittently, which is how most hardware ends up being used anyway.
Maybe no fundamental bottlenecks but it's easy to accidentally write software that doesn't scale as linearly as it should, e.g. if there's suddenly more lock contention than you were expecting, or in a more extreme case if you have something that's O(n^2) in time or space, where n is core count.
That's a great point. Linux has introduced io_uring, and I believe that gives us the native primitives to hide latency better?
But that's just one piece of the puzzle, I guess.
I think linux can handle upto 1024 cores just fine.
afaik the mainline limit is 4096 threads. HP sells server with 32 sockets x 60 cores/socket x 2 threads/core = 3840 threads, so we are pretty close to that limit.
I had no idea we had socket counts so high, do you know where I could find a picture of one?
It's bit cheating because it's cluster based system: https://www.hpe.com/psnow/doc/a50004268enw
So 4 sockets per chassis, up to 8 chassis in a complete system. Afaik OS sees it as single huge system, that is kinda their special sauce here.
Sounds like a HPE Compute Scale-up Server 3200, but again keep in mind that's something where there's probably a fabric between nodes one way or another.
https://xkcd.com/619/
> OS/runtimes weren’t really designed with hundreds of cores and complex interconnect topologies in mind.
I mean....
IMO Erlang/Elixir is a not-terrible benchmark for how things should work in that state... Hell while not a runtime I'd argue Akka/Pekko on JVM Akka.Net on the .NET side would be able to do some good with it...[0] Similar for Go and channels (at least hypothetically...)
[0] - Of course, you can write good scaling code on JVM or CLR without these, but they at least give some decent guardrails for getting a good bit of the Erlang 'progress guaranteed' sauce.
There definitely are bottlenecks. The one I always think of is the kernel's networking stack. There's no sense in using the kernel TCP stack when you have hundreds of independent workloads. That doesn't make any more sense than it would have made 20 years ago to have an external TCP appliance at the top of your rack. Userspace protocol stacks win.
Do the partitioned stacks of network namespaces share a single underlying global stack or are they fully independent instances? (And if not, could they be made so?)
Usually network namespaces are linked together with a single bridge so you can get lock contention there.
If you have a separate physical NIC for each namespace you probably won't have any contention.
I think you could get much of the way there by isolating a single NIC's receive queues, so the kernel doesn't decide to run off and service softirqs for random foreign tasks just because your task called tcp_sendmsg.
io_uring?
If anything, uring makes the problem much worse by reducing the cost of one process flooding kernel internals in a single syscall.
> I wonder whether the next bottleneck becomes software scheduling rather than silicon
Yep, the scheduling has been a problem for a while. There was an amazing article few years ago about how the Linux kernel was accidentally hardcoded to 8 cores, you can probably google and find it.
IMO the most interesting problem right now is the cache, you get a cache miss every time a task is moving core. Problem, with thousands of threads switching between hundreds of cores every few milliseconds, we're dangerously approaching the point where all the time is spent trashing and reloading the CPU cache.
I searched for "Linux kernel limited to 8 cores" and found this
https://news.ycombinator.com/item?id=38260935
> This article is clickbait and in no way has the kernel been hardcoded to a maximum of 8 cores.
That's the one. Funny thing, it's not actually clickbait.
The bug made it to the kernel mailing list where some Intel people looked into it and confirmed there is a bug. There is a problem where is the kernel allocation logic was capped to 8 cores, which leaves a few percent of performance off the table as the number of cores increase and the allocation is less and less optimal.
It's classic tragedy of the commons. CPU have got so complicated, there may only be a handful of people in the world who could work and comprehend a bug like this.