Thank you for explaining. Do you think there are still opportunities for stack optimizations to meaningfully speed up inference on single consumer-grade GPUs?