Try using a coding agent to write an efficient GPU kernel. I guess they might get good at it soon, but they definitely aren't there yet.

I had a very complex cuda kernel and codex cli managed to improve the throughout 20x.