I had a very complex cuda kernel and codex cli managed to improve the throughout 20x.