There's an error here: “NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded.”
Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon, so it shouldn't push out other things in the cache. They may skip the cache entirely, or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.
> Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon
I disagree with this statement (taken at face value, I don't necessarily agree with the wording in the OP either). Non-temporal instructions are unordered with respect to normal memory operations, so without a _mm_sfence() after doing your non-temporal writes you're going to get nasty hardware UB.
I had interpreted GP to mean that you don’t slap on NTs for correctness reasons, rather you do it for performance reasons.
That is something I can agree with, but I can't in good faith just let "it's just a hint, they don't have anything to do with correctness" stand unchallenged.
You mean if you access it from a different core? I believe that within the same core, you still have the normal ordering, but indeed, non-temporal writes don't have an implicit write fence after them like x86 stores normally do.
In any case, if so they are potentially _less_ correct; they never help you.
There are no guarantees even if everything operates on the same core. Rust docs have some details: https://doc.rust-lang.org/stable/core/arch/x86_64/fn._mm_sfe...
Do you have any Intel references for it? I mean, Rust has its own memory model and it will not always give the same guarantees as when writing assembler.
https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
Intel's docs are unfortunately spartan, but the guarantees around program order is a hint that this is what it does.
That doc is about visibility _outside the core_ (“globally visible”), so it's not what I'm looking for.
Similarly, if I look up MOVNTDQ in the Intel manuals (https://www.intel.com/content/dam/www/public/us/en/documents...), they say:
“Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with VMOVNTDQ instructions if multiple processors might use different memory types to read/write the destination memory locations”
Note _if multiple processors_.
I work on optimizations like this at work, and yes this is largely correct. But do you have a source on this?
> or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.
I hadn’t heard of this before. It looks like older x86 CPUs may have had a dedicated cache.
IIRC they used the write-combining buffer, which was also a cache.
A common trick is to cache it but put it directly in the last or second-to-last bin in your pseudo-LRU order, so it's in cache like normal but gets evicted quickly when you need to cache a new line in the same set. Other solutions can lead to complicated situations when the user was wrong and the line gets immediately reused by normal instructions, this way it's just in cache like normal and gets promoted to least recently used if you do that.
A source on what? The Intel optimization manuals explain what MOVNTQ is for. I don't think they explain in detail how it is implemented behind-the-scenes.
See e.g. https://cdrdv2.intel.com/v1/dl/getContent/671200 chapter 13.5.5:
“The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) allow data to be moved from the processor’s registers directly into system memory without being also written into the L1, L2, and/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. These instructions operate on data in the general-purpose, MMX, and XMM registers.”
I believe that non-temporal moves basically work similar to memory marked as write-combining; which is explained in 13.1.1: “Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. The WC buffer is not snooped and thus does not provide data coherency. Buffering of writes to WC memory is done to allow software a small window of time to supply more modified data to the WC buffer while remaining as non-intrusive to software as possible. The buffering of writes to WC memory also causes data to be collapsed; that is, multiple writes to the same memory location will leave the last data written in the location and the other writes will be lost.”
In the old days (Pentium Pro and the likes), I think there was basically a 4- or 8-way associative cache, and non-temporal loads/stores would go to only one of the sets, so you could only waste 1/4 (or 1/8) on your cache on it at worst.
I see, thanks. I had assumed incorrectly that NT writes operated the same as NT accesses, where there is no dedicated cache.