Hello, I'm trying to use non-temporal stores on my application, and see a perf decrease when doing so. I'm wondering if the data is pushed on the L3 or memory. In the latter case it could explain the result as the data could have been used again while staying in L3 (i.e. smooth non-tempral ;-). thanks for your feedback ! Marc
Non-temporal store should evict the cache line from all levels of cache. While this should improve performance if the cache line is never used again, it can make a serious degradation in the case where the data must be restored to cache. The usual reason for using non-temporal is for a long sequence of stores which would evict everything else from cache, leaving only the last sequence of stored cache lines.
Tim, can't you point to an example where non-temporal stores actually improve performance? Each time I try to give them one more chance, I always observe perfornace degradation. I tried to use them in scenarios with exactly a lot of sequential stores of a data which is not reused. Perhaps it's some details that I missed...
Non-temporal, in our experience, gives largest gains for a loop which sets a very large array to a scalar value. Both the automatic fast_memset() and fast_memcpy() function substitutions lead to run-time determination of whether nontemporal should be effective. It usually requires an array which spans multiple 4KB pages to see a benefit. Nontemporal is not yet implemented under AVX compilation for some important cases where a significant fraction of the stores in a loop could be nontemporal, giving perhaps 15% speedup for a loop which completely replaces contents of L3.
Assume you are on a system with 8-way set associative L1 cache. --------- Question 1:
MeaningL1 cachemay contain between 0 to 8 cache lines with the same alias (32KB/64KB address modulus). Does the Non-Temporal Store evict:
a) only the potential single cached cache line or b) all cache lines with the same alias (32KB/64KB address modulus).
Of corse, a) would be preferred.
-------------- Question 2:
Is there a way, other than for Non-Temporal Store, to indicate the cache line specified is not likely to be re-referenced shortly, but you would rather not evict it unless it becomes necessary.
IOW - when the cache system needs to evict a cache line, the line previously specified becomes the most likely candidate for eviction.
Example hypothetical implementation:
0F 18 / 01 PREFETCHT0 m8 Move data from m8 closer to processor using T0 hint 0F 18 / 81 RETIRET0 m8 Mark data from m8 as evictable from T0 to T1,T2,RAM
0F 18 / 02 PREFETCHT1 m8 Move data from m8 closer to processor using T1 hint 0F 18 / 82 RETIRET1 m8 Mark data from m8 as evictable from T0,T1 to T2,RAM
0F 18 / 03 PREFETCHT1 m8 Move data from m8 closer to processor using T2 hint 0F 18 / 83 RETIRET1 m8 Mark data from m8 as evictable from T0,T1,T2 to RAM ...
--------------- Question 3:
Will PREFETCHT0 be "fixed" to prefetch into L1? (as was on Pentium III)
Currently I see marginal improvement in migrating data from L2 to L1 by inserting dummy "mov r15,[rcx+readAhead]". But this eats a register. "PREFETCHT0 [rcx+readAhead]" wouldn't eat a register (but it doesn't currently preload L1). I have not experimented using "test [rcx+readAhead],0" (without jcc nearby), so it is unknown to me as to if this would stall the pipeline.
I don't recall any availability of 8-way associativity, but I'll leave that to experts. I agree, it's reasonable to expect that non-temporal store saves cache associativity ways for other data and so avoids potential unexpected evictions. If you're suggesting a scheme to suggest that data be evicted rapidly from L1 but reside in L3, for example, it looks interesting, but maybe not worth the trouble to improve on behavior which occurs semi-automatically. As far as I know, current CPUs generally don't support prefetch direct to L1, presumably on account of simulations showing a lack of net benefit some time in the past; I wouldn't expect the situation Jim describes to change.
So do you want to say that zeroizing of 1GB will be faster with non-temporal stores? I just need to be sure this time (I do not want to do one more benchmark that will show that plain stores are faster again).
Yes, zero set of a large region should be faster with non-temporal. However, Intel compilers will recognize for() or DO loops which explicitly zero an array and automatically substitute intel_fast_memset(), so it will make no difference when you apply VECTOR NONTEMPORAL. Likewise, in my experience, VECTOR NONTEMPORAL is ignored under the AVX options (presumably a bug to be fixed some day), so you would require intrinsics for those cases where the compiler doesn't make the fast_memset or fast_memcpy substitutions.
>>If you're suggesting a scheme to suggest that data be evicted rapidly from L1 but reside in L3, for example, it looks interesting, but maybe not worth the trouble to improve on behavior which occurs semi-automatically
Yes, although for "evicted rapidly from L1", I would say "evicted preferably from L1".
The prefetch is "I will be using this shortly, try to bring the data closer" The post-use, is the inverse of that. This should not cause eviction, rather it should mark the cache line as preferable eviction candidate.
Currently, the eviction logic is a combination of frequency of access and MRU. The post-use instruction, would mark cache line a LRU (Least Recently Used andconsequentlyMost Likely Eviction Candidate).
As for prefetch to L1, I can see two ways of implementing this:
a) Use (fix) prefetcht0 to fetch into L1 b) After read to register, insert sufficient number of instructions to cover L2 latency before use of register. c) Read cache line into register but do not make dependency on register (do not use contents of register).
Route b) can be use in limited circumstances (where you have sufficient registers available to cover the latency).
I can see measurable improvements by using technique c) where normal processing fetch is performed with SSE (movaps xmm,[m128]) and where early-on L2 to L1 prefetching is performed with integer instruction (mov reg,[reg+x]). I use this technique but currently I am targeting different integerregisters.
What might aid for technique c), and which may actually work now (untested by me), is when say you issue
movaps xmm1,[rax+offsetPrefetch] movaps xmm1,[rax] ; ** same destination register
That the preferred action is thesecond instruction vacates the pending fetchdependency on xmm1 produced by the readin the first instruction, but without terminating the fetch from [eax+offsetPrefetch]. IOW the first instruction exhibits a 1 clock throughput overhead (as it does now) but there is no dependencylatency (as there is no longer a dependancy on the destination register load). I do not know the internal workings of the cache system. This may require retargeting the destination register to a "NULL" register.
Note different target registers on L2 to L1 prefetch
As to when the data reaches L1, this would depend on where it resided (L2, L3, RAM). Note, unlike prefetcht0, where if the TLB is not in the TLB cache the prefetch is canceled. In this technique, the TLB cache gets loaded when necessary, followed by the fetch from memory. It is up to the programmer to assure the data (and TLBs) are properly positioned as to avoid unnecessary stalls in execution.
Other areas that I have experimented with (with no complete test data to make formal recomendation)
mov reg64,[m64] ... (other instructions here to cover load latency) movq xmm,reg64 ... (other instructions here to cover latency) movddup xmm,xmm ... (other instructions here to cover latency) mulpd xmmOther,xmm
With the above interlieved
While the above technique increases the working set of available xmm registers (as none are used for fetching from L1 to registers), it suffers the overhead of taking 3 clocks of throughputto accomplish what could be done in 1 clock of throughput. When the data is in L1 the single "movddup xmm,[m64]" is as efficient.Additionally, it suffers an additional L1 latency when fetching the adjacent memory location (double).
works as well _provided_ you have sufficient xmm registers to buffer the load. When the data is located in L1 you need to insert 4 to 6 non-dependent instructions between the load and use. This is difficult to do. When the data is in L2, you would need to insert an additional 5 instructions to cover the latency.
Loading into reg64, followed by instruction insertion, then movq, and movddupis advantageous when the preponderance of data is in L2. The reason being is you have an additional 12 or so registers for usefor inflight loadsover the 16 availablexmm registers.