> many GPUs
Citation please - every single GPU in the literal world supports integer arithmetic for operating on tid, gid, etc.
> many GPUs
Citation please - every single GPU in the literal world supports integer arithmetic for operating on tid, gid, etc.
From page 175 of the AMD CDNA4 ISA:
https://www.amd.com/content/dam/amd/en/documents/instinct-te...
> V_MUL_U32_U24
>,Multiply two unsigned 24-bit integer inputs and store the result as an unsigned 32-bit integer into a vector register. D0.u32 = 32'U(S0.u24) * 32'U(S1.u24)
> Notes
> This opcode is expected to be as efficient as basic single-precision opcodes since it utilizes the single-precision floating point multiplier. See also V_MUL_HI_U32_U24.
Nvidia GPUs used to do the same thing and theres a umul24 intrinsic if you care to use it.
https://stackoverflow.com/questions/5544355/cuda-umul24-func...
This is super-super-niche since it basically only applies to 32-bit integer multiplication.
You likely won't run into it unless you're doing high performance embedded systems or GPU programming on non-NVDIA cards, and for some unknowable reason, your workload does a 32-bit integer multiplication in the hot path.
That's literally only for 32bx24b (I don't remember why we did that specifically for CDNA - I'll ask someone) but as you see from V_MUL_HI_I32, V_MUL_LO_U32 there is very much vector arithmetic hardware (nevermind that we're not talking about VALU but conventional scalar ALU).
I think he has a point, but I am still not 100% convinced by the arguments relating to casting.
There is a difference between a u24 data type inside u32 and a u24 datatype inside u24 and that is what's so frustrating here. u24 is an alignment nightmare so it will basically never exist as "u24 in u24" and only ever as "u24 in u32".
For casting to make sense, the alignment must be compatible and it's not clear how you can simultaneously make arbitrary bit data types simultaneously useful for the scenario of describing bit fields in packets, where padding is inherently undesirable and performing integer arithmetic with an FPU, where padding is an acceptable cost for alignment. These appear to be mutually exclusive use cases.
While the GP might be technically wrong in a narrow sense, GPUs are built for FP, and that's what you want to be doing if you're using them as accelerators.
You don't know what you're talking about: an enormous amount of TOPs now runs through quantized (read: integer) kernels. Many GPUs don't have even FP64 or even FP32 support.
EDIT: I was completely wrong, I have mostly worked with GGUF and related quantizations that are LUTs, thank you for correcting me.
> The quantized integer kernels aren't running true integer multiplication, the quantization is it's own thing, they're basically enums not integers
ELI-a-GPU-compiler-engineer-working-at-a-major-vendor (because I am). Ie I can pull up the design docs for our ALUs and literally see that you're wrong.