Oh this is really cool! I didn't know Go has added this!

I went on a similar adventure but in Zig. Since I had to prepare a benchmarking suite, I put out one in case anyone needs it. If you think it might be helpful, give it a go: https://github.com/peymanmortazavi/csv-race

In my findings, using 64 bytes (512-bits) even when possible actually degraded the performance. I also had to fine-tune the numbers for different CPUs. For instance on Apple, I could go much higher but on my CPU, if I went to 64 bytes (512-bits), It would degrade the performance.

Another thing I explored was to iterate on the fields as opposed to records. This allows you to just avoid any copying or dynamic memory allocation, which should give you a pretty decent boost. You can add utility wrappers to match Go's record based iteration when it is necessary.

Just some thoughts! but congrats on this!!