I've experienced modest but significant improvements in speed using very basic pragma omp section style parallelizing of this sort of thing.
I've experienced modest but significant improvements in speed using very basic pragma omp section style parallelizing of this sort of thing.
Do you remember any specifics? For example, the size of the copy, whether it was a NUMA system, or the total bandwidth of your system RAM?