This programming model seems like the wrong one, and I think its based on some faulty assumptions

>Another advantage of this approach is that it prevents divergence by construction. Divergence occurs when lanes within a warp take different branches. Because thread::spawn() maps one closure to one warp, every lane in that warp runs the same code. There is no way to express divergent branching within a single std::thread, so divergence cannot occur

This is extremely problematic - being able to write divergent code between lanes is good. Virtually all high performance GPGPU code I've ever written contains divergent code paths!

>The worst case is that a workload only uses one lane per warp and the remaining lanes sit idle. But idle lanes are strictly better than divergent lanes: idle lanes waste capacity while divergent lanes serialize execution

This is where I think it falls apart a bit, and we need to dig into GPU architecture to find out why. A lot of people think that GPUs are a bunch of executing threads, that are grouped into warps that execute in lockstep. This is a very overly restrictive model of how they work, that misses a lot of the reality

GPUs are a collection of threads, that are broken up into local work groups. These share l2 cache, which can be used for fast intra work group communication. Work groups are split up into subgroups - which map to warps - that can communicate extra fast

This is the first problem with this model: it neglects the local work group execution unit. To get adequate performance, you have to set this value much higher than the size of a warp, at least 64 for a 32-sized warp. In general though, 128-256 is a better size. Different warps in a local work group make true independent progress, so if you take this into account in rust, its a bad time and you'll run into races. To get good performance and cache management, these warps need to be executing the same code. Trying to have a task-per-warp is a really bad move for performance

>Each warp has its own program counter, its own register file, and can execute independently from other warps

The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each *thread* has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

Say we have two warps, both running the same code, where half of each warp splits at a divergence point. Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level. But notice that to get this hardware acceleration, we need to actually use the GPU programming model to its fullest

The key mistake is to assume that the current warp model is always going to stick rigidly to being strictly wide SIMD units with a funny programming model, but we already ditched that concept a while back on GPUs, around the Pascal era. As time goes on this model will only increasingly diverge from how GPUs actually work under the hood, which seems like an error. Right now even with just the local work group problems, I'd guess you're dropping ~50% of your performance on the table, which seems like a bit of a problem when the entire reason to use a GPU is performance!

> Modern GPUs will go: huh, it sure would be cool if we just shifted the threads about to produce two non divergent warps, and bam divergence solved at the hardware level

Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

> Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

>I would strongly advise against this. GPUs are highly efficient when neighboring threads within a warp access neighboring data and follow largely the same code path. Even across warps, data locality is highly desirable.

Its a bit like saying writing code at all is bad though. Divergence isn't desirable, but neither is running any code at all - sometimes you need it to solve a problem

Not supporting divergence at all is a huge mistake IMO. It isn't good, but sometimes its necessary

>Could you kindly share a source for this? Shader Execution Reordering (SER) is available for Ray tracing, but it is not a general-purpose feature that can be used in generic compute shaders.

https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

My understanding is that this is fully transparent to the programmer, its just more advanced scheduling for threads. SER is something different entirely

Nvidia are a bit vague here, so you have to go digging into patents if you want more information on how it works

>The second problem is: it used to be true that all threads in a warp would execute in lockstep, and strictly have on/off masks for thread divergence, but this is strictly no longer true for modern GPUs, the above is just wrong. On a modern GPU, each thread has its own program counter and callstack, and can independently make forward progress. Divergent threads can have a better throughput than you'd expect on a modern GPU, as they get more capable at handling this. Divergence isn't bad, its just something you have to manage - and hardware architectures are rapidly improving here

I haven't found any evidence of the individual program counter thing being true beyond one niche application: Running mutexes for a single vector lane, which is not a performance optimization at all. In fact, you are serializing the performance in the worst way possible.

From a hardware design perspective it is completely impractical to implement independent instruction pointers other than maybe as a performance counter. Each instruction pointer requires its own read port on the instruction memory and adding 32, 64 or 128 read ports to SRAM is prohibitively expensive, but even if you had those ports, divergence would still lead to some lanes finishing earlier than others.

What you're probably referring to is a scheduler trick that Nvidia has implemented where they split a streaming processor thread with divergence into two masked streaming processor threads without divergence. This doesn't fundamentally change anything about divergence being bad, you will still get worse performance than if you had figured out a way to avoid divergence. The read port limitations still apply.

Threads have program counters individually according to nvidia, and have done for nearly 10 years

https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

> the GPU maintains execution state per thread, including a program counter and call stack, and can yield execution at a per-thread granularity

Divergence isn't good, but sometimes its necessary - not supporting it in a programming model is a mistake. There are some problems you simply can't solve without it, and in some cases you absolutely will get better performance by using divergence

People often tend to avoid divergence by writing an algorithm that does effectively what pascal and earlier GPUs did, which is unconditionally doing all the work on every thread. That will give worse performance than just having a branch, because of the better hardware scheduling these days