This is outside of my expertise, but wouldn't multiple threads each submitting a single operation in parallel have the same effect?

That is still “async” considering what gp wrote.

Because they wrote “thread per task” which I assume to mean something like “each os thread handles the work submitted by one user”.

This is beside the point but, something like io_uring is still significantly better than doing threadpool nvme io.