FP64 performance is limited on consumer because the US government deems it important to nuclear weapons research.

Past a certain threshold of FP64 throughput, your chip goes in a separate category and is subject to more regulation about who you can sell to and know-your-customer. FP32 does not matter for this threshold.

https://en.wikipedia.org/wiki/Adjusted_Peak_Performance

It is not a market segmentation tactic and has been around since 2006. It's part of the mind-numbing annual export control training I get to take.

It's surprising that this restriction continues to linger at all. The newest nuclear warhead models in the US arsenal were developed in the 1970s, when supercomputer performance was well below 1 gigaflop. When the US stopped testing nuclear warheads in 1992, top end supercomputers were under 10 gigaflops. The only thing the US arsenal needs faster computers for is simulating the behavior of its aging warhead stockpile without physical tests, which is not going to matter to a state building its first nuclear weapons.

[deleted]

This is so interesting, especially given that it is in theory possible to emulate FP64 using FP32 operations.

I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."

I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.

Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?

https://www.federalregister.gov/documents/2018/10/24/2018-22... What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."

Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?

Wikipedia links to this guide to the APP, published in December 2006 (much closer to when the rule itself came out): https://web.archive.org/web/20191007132037/https://www.bis.d.... At the end of the guide is a list of examples.

Only two of these examples meet the definition of vector processor, and these are very clearly classical vector processor computers, the Cray X1E and the NEC SX-8 (as in, if you're preparing a guide on historical development of vector processing, you're going to be explicitly including these systems or their ancestors as canonical examples of what you mean by a vector super computer!). And the definition is pretty clearly tailored to make sure that SIMD units in existing CPUs wouldn't qualify for the definition of vector processor.

The interesting case to point out is the last example, a "Hypothetical coprocessor-based Server" which hypothetically describes something that is actually extremely similar to the result of GPGPU-based HPC systems: "The host microprocessor is a quad-core (4 processors) chip, and the coprocessor is a specialized chip with 64 floating-point engines operating in parallel, attached to the host microprocessor through a specialized expansion bus (HyperTransport or CSI-like)." This hypothetical system is not a "vector processor," it goes on to explain.

From what I can find, it seems that neither NVidia nor the US government considers the GPUs to count as vector processors and thus give it the 0.3 rather than the 0.9 weight.

> it is in theory possible to emulate FP64 using FP32 operations

I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf

Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)

While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.

And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.

What is easy to do is to emulate FP128 with FP64 (double-double) or even FP256 with FP64.

The reason is that the exponent range of FP64 is typically sufficient to avoid overflows and underflows in most applications.

On the other hand, the exponent range of FP32 is insufficient for most scientific-technical computing.

For an adequate exponent range, you must use either three FP32 per FP64, or two FP32 and an integer. In this case the emulation becomes significantly slower than the simplistic double-single emulation.

With the simpler double-single emulation, you cannot expect to just plug it in most engineering applications, e.g. SPICE for electronic circuit simulation, and see that the application works. Some applications could be painstakingly modified to work with such an implementation, but that is not normally an option.

So to be interchangeable with the use of standard FP64 you really must also emulate the exponent range, at the price of much slower emulation.

I did this at some point in the past, but today it makes no sense in comparison with the available alternatives.

Today, the best FP64 performance per dollar by far, is achieved with Ryzen 9950X or Ryzen 9900X, in combination with Inter Battlemage B580 GPUs.

When money does not matter, you can use AMD Epyc in combination with AMD "datacenter" GPUs, which would achieve much better performance per watt, but the performance per dollar would be abysmally low.

Oh yes I forgot to mention it, you’re absolutely right, Thall’s method for df64 and qf128 gives you double/quad precision mantissa with single-precision exponent ranges, and the paper is clear about that.

FWIW, my own example (emulating doubles/quads with ints) gives the full exponent range with no wasted bits since I’m just emulating IEEE format directly.

Of course there are also bignum libraries that can do arbitrary precision. I guess one of the things I meant to convey but didn’t say directly is that using double precision isn’t export controlled, as one might interpret the top of thi thread, but a certain level of fp64 performance might be.

Can't wait until they update this to also include export controls around FP8 and FP4 etc in order to combat deepfakes, and then all of a sudden not be able to buy increasingly powerful consumer GPUs.

[flagged]