288 cores is an absurd number of cores.
Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…
That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).
> 288 cores is an absurd number of cores.
Way back in the day, I built and ran the platform for a business on Pentium grade web & database servers which gave me 1 "core" in 2 rack units.
That's 24 cores per 48 unit rack, so 288 cores would be a dozen racks or pretty much an entire aisle of a typical data center.
I guess all of Palo Alto Internet eXchange (where two of my boxen lived) didn't have much more than a couple of thousand cores back in 98/99. I'm guessing there are homelabs with more cores than that entire PAIX data center had back then.
Oh yeah, it is not that many cores for the cluster-universe. Just neat to see the number of cores per socket increase.
A while ago I had access to an 8-socket shared memory machine… but this was the semi-olden days, so it was “only” 80 cores. It was a fun machine at the time! We’re so spoiled these days, haha.
Sierra Forest (the 288-core one) does not have AVX512.
Intel split their server product line in two:
* Processors that have only P-cores (currently, Granite Rapids), which do have AVX512.
* Processors that have only E-cores (currently, Sierra Forest), which do not have AVX512.
On the other hand, AMD's high-core, lower-area offerings, like Zen 4c (Bergamo) do support AVX512, which IMO makes things easier.
Largely true, but there is always a caveat.
On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…
Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(
This is an often repeated myth, which is only half true.
On Zen 4 and Zen 4c, for most vector instructions the vector datapaths have the same width as in Intel's best Xeons, i.e. they can do two 512-bit instructions per clock cycle.
The exceptions where AMD has half throughput are the vector load and store instructions from the first level cache memory and the FMUL and FMA instructions, where the most expensive Intel Xeons can do two FMUL/FMA per clock cycle while Zen 4/4c can do only 1 FMUL/FMA + 1 FADD per clock cycle.
So only the link between the L1 cache and the vector registers and also the floating-point multiplier have half-width on Zen 4/4c, while the rest of the datapaths have the same width (2 x 512-bit) on both Zen 4/4c and Intel's Xeons.
The server and desktop variants of Zen 5/5c (and also the laptop Fire Range and Strix Halo CPUs) double the width of all vector datapaths, exceeding the throughput of all past or current Intel CPUs. Only the server CPUs expected to be launched in 2026 by Intel (Diamond Rapids) are likely to be faster than Zen 5, but by then AMD might also launch Zen 6, so it remains to be seen which will be better by the end of 2026.
512 bits is the least important part of AVX-512. You still get all the masks and the fancy functions.
Sadly, no! On the bright side, they support new AVX2 VNNI extensions, that help with low precision integer dot products for Vector Search!
SimSIMD (inside USearch (inside ClickHouse)) already has those SIMD kernels, but I don’t yet have the hardware to benchmark :(
Something that could help is to use llvm-mca or similar to get an idea of the potential speedup.
A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.
It is pretty wide, but 288 cores with 8x FP32 lanes each is still only about a tenth of the lanes on an RTX 5090. GPUs are really, really, really wide.
AVX-512 is on the P-cores only (along with AMX now). The E-cores only support 256-bit vectors.
If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.
The 288 core SKU (I believe 6900E) isn't very widely available, I think only to big clouds?
I mean, yeah, it's "a lot" because we've been starved for so long, but having run analytics aggregation workloads I now sometimes wonder if 1k or 10k cores with a lot of memory bandwidth could be useful for some ad-hoc queries, or just being able to serve an absurd number of website requests...
CPU on PCIe card seems like it matches with the Intel Xeon Phi... I've wondered if that could boost something like an Erlang mesh cluster...
https://en.m.wikipedia.org/wiki/Xeon_Phi
how long until I have 288 cores under my desk I wonder?
Does 2x160 cores count?
https://www.titancomputers.com/Titan-A900-Octane-Dual-AMD-EP...
Damn, when I first landed on the page I saw $7,600 and thought "for 320 cores that's pretty amazing!" but that's the default configuration with 32 cores & 64GB of memory.
320 cores starts at $28,000.. $34k with 1TB of memory..
The CPU has launch price of $13k already, so $28k is a good deal imo
It's not a good deal for me though. ;-)
640k of RAM is totally absurd.
So is 2 GB of storage.
And 2K of years.