Really cool experiment (the whole company).
Training pipelines are full of data preparation that are first written on CPU then moving to GPU and always thinking of what to keep on CPU and what to put on GPU, when is it worth to create a tensor, or should it be tiling instead. I guess your company is betting on solving problems like this (and async-await is needed for serving inference requests directly on the GPU for example).
My question is a little bit different: how do you want to handle the SIMD question: should a rust function be running on the warp as a machine with 32 long arrays as data types, or always ,,hope'' for autovectorization to work (especially with Rust's iter library helpers).
I'm not even sure a 32 wide array would be good either since on AMD warps are 64 wide. I wouldn't go fully towards auto vectorization with though.
Warp SIMD-width should be a build-time constant. You'd be using a variable-length vector-like interface that gets compiled down to a specified length as part of building the code.
Now that I could agree with, the only place where hiccups have started to occur are with wave intrinsics where you can share data between thread in a wave without halting execution. I'm not sure disallowing it would be the best idea as it cuts out possible optimizations, but outright allowing it without the user knowing the number of lanes can cause it's own problems. My job is the fun time of fixing issues in other peoples code related to all of this. I have no stakes in rust though, I'd rather write a custom spirv compiler.
A compile time constant can still be surfaced to the user though. The code would simply be written to take the actual value into account and this would be reflected during the build.
I don't have a lot of faith there, but that's mainly due to my experience being correcting peoples assumption that all gpus waves are 32 lanes. I might be biased there specifically since it's my job to fix those issues though.