I'd expect memcpy calls to turn into builtin_memcpy and then into raw loads/stores for known small N and a call into compiler-rt for unknown or large N. If it doesn't, patches to do that for your architecture are likely appreciated.
Calling a function with 'builtin' in the name doesn't mean it's embedded in the CPU itself to run concurrently which I think is what they thought might exist.
From my college days, which were quite long ago. And working with Win32 "BitBlt" requests to the OS, etc.
And also, it would just make sense. If copying entire blocks or memory pages, such as "BitBlt", is one command, why would I need CPU cycles to actually do it? It would seem like the lowest hanging fruit to automate in SDRAM
These are contradictory things. SIMD instructions are still regular instructions, not some concurrent system for copying. When you say command, maybe you meant a windows OS function that was similar to memcpy. An OS function and individual CPU instructions are two different thing. There is something called DMA, but I don't know how much that is used for memory to memory copies.
I'm not making a case for anything I'm just explaining what exists. If copying were going to be done in bulk it would have to be done asynchronously to some extent, though CPUs already work like that on a small scale due to instruction reordering.
Now it might be less necessary because CPUs are so fast with contiguous data memory that copying to other parts of memory are less of a bottleneck.
I'd expect memcpy calls to turn into builtin_memcpy and then into raw loads/stores for known small N and a call into compiler-rt for unknown or large N. If it doesn't, patches to do that for your architecture are likely appreciated.
Calling a function with 'builtin' in the name doesn't mean it's embedded in the CPU itself to run concurrently which I think is what they thought might exist.
From my college days, which were quite long ago. And working with Win32 "BitBlt" requests to the OS, etc.
And also, it would just make sense. If copying entire blocks or memory pages, such as "BitBlt", is one command, why would I need CPU cycles to actually do it? It would seem like the lowest hanging fruit to automate in SDRAM
It just seems like the easiest example of SIMD
These are contradictory things. SIMD instructions are still regular instructions, not some concurrent system for copying. When you say command, maybe you meant a windows OS function that was similar to memcpy. An OS function and individual CPU instructions are two different thing. There is something called DMA, but I don't know how much that is used for memory to memory copies.
Well CPUs already transparently handle memory paging so why not copying?
https://en.wikipedia.org/wiki/Memory_paging
I'm not making a case for anything I'm just explaining what exists. If copying were going to be done in bulk it would have to be done asynchronously to some extent, though CPUs already work like that on a small scale due to instruction reordering.
Now it might be less necessary because CPUs are so fast with contiguous data memory that copying to other parts of memory are less of a bottleneck.