icache_snoop_off - when is it safe?

Patrick_S_ · ‎08-16-2014

Hi all,

Turning off instruction cache snooping brings a performance increase of roughly 10% for my application. Mike Wade described in his blog

https://software.intel.com/en-us/blogs/2013/05/15/disabling-instruction-cache-snooping-on-xeon-phi

that it is not officially supported. So my question is which types of applications can't use icache_snoop_off? In my case, the result is of my program is still correct with disabled snooping. Does that imply I'm good with using it?

Thanks,

Patrick

McCalpinJohn · ‎08-16-2014

As it says in the blog entry, instruction cache snooping serves primarily to support the needs of self-modifying code and cross-modifying code. These are not likely to be common in Xeon Phi applications, but they could easily exist in support routines. One of the most common examples of self-modifying code comes from some "just-in-time" compilers that dynamically adjust the code they produce based on factors that may change during the execution of the run (e.g., "live" run-time feedback-directed optimization). There should not be a problem if code is written *once* and then executed (the instruction fetch will miss in the instruction cache and then get the correct data from the L2 Cache or L1 Data Cache), but if *different* instructions are written to the same address, the instruction cache may continue to use the earlier ("stale") version.

Workarounds may not be easy. The Xeon Phi does not support the CLFLUSH instruction (which may or may not work in this case anyway, since the implementation could easily depend on the Instruction cache receiving an invalidation message from the L2 cache). It is not clear whether the CLEVICT0 instruction is intended to apply to the instruction cache, but the documentation says that this instruction is treated as a performance hint and could be dropped. It should be possible to force the instruction cache to be flushed with the WBINVD instruction (which flushes all caches), but that instruction can only run in the kernel. It would not be difficult to write a loadable kernel module to provide this function (which already exists in the Linux kernel in several places, including kernel/source/arch/x86/mm/pageattr.c and /kernel/source/arch/x86/lib/cache-smp.c, as well as in a number of the virtualization support functions.

Patrick_S_ · ‎08-17-2014

> but they could easily exist in support routines

Do you know some types of applications/ support routines, which might be relevant in scientific computing? My program is compiled once and use some C++ libraries as well as OpenMP. So I guess that there should be no problem? Actually, I feel a little bit uncomfortable with using icache_snoop_off. Is there a way to check if I'm using some SMC/CMC ?

McCalpinJohn · ‎08-18-2014

I don't know any way of checking for self-modifying code. The performance counters don't appear to have a directly applicable event. (This is not surprising -- instruction cache issues are seldom very important in application areas targeted by Xeon Phi.) If you are confident that you can exercise all of the code paths with a simple set of tests, then just checking to see if the code gives the expected answers is certainly the easiest approach. That may not be good enough -- which is likely one of the reasons that this remains an officially unsupported feature.

Shifting the topic slightly:

There are counters that may be useful for indicating whether disabling instruction cache snooping might be useful. One can estimate the total number of cycles spent servicing instruction cache misses as something like:

(CODE_CACHE_MISS - L2_CODE_READ_MISS_CACHE_FILL - L2_CODE_READ_MISS_MEM_FILL) * 24 cycles

+ L2_CODE_READ_MISS_CACHE_FILL * 275 cycles

+ L2_CODE_READ_MISS_MEM_FILL * 300 cycles

The first line should be a reasonable estimate of the stalls incurred when missing the L1 instruction cache and hitting in the L2 cache. The second line is fuzzier -- the latency for cache-to-cache interventions ranges from ~120 cycles to ~380 cycles, depending on the relative core numbers and the physical address (which determines the Distributed Tag Directory to use). If the code is available in multiple alternate L2 caches, I don't know how the Distributed Tag Directory chooses the one to provide the data. "Smart" choices could make the average value significantly lower than 275 cycles if the cache line containing the code is available in many L2 caches around the ring. The third line is a reasonable estimate of the memory latency averaged across all eight memory controllers, though this is also variable, ranging from ~140 cycles to a bit more than 400 cycles. The exact value depends on the relative positions of the core, the Distributed Tag Directory, and the memory controller containing the target cache line.

If you are running one thread per core then most of these instruction cache miss servicing cycles will be stall cycles. With multiple threads, the other threads can keep going, but if there is a synchronization point then all the threads will eventually have to wait for the slowest thread, so these will often show up as stall cycles in that case too.

For OpenMP programs that are doing mostly the same operations on all cores, one would expect the same code to be in all the L2 caches, so L1 instruction cache misses caused by L2 victim invalidations should be satisfied by other L2 caches rather than by memory.

If the L2 victim invalidations are happening on all cores, then you might be able to simply change the location of the text segment.

It seems more likely that the L2 victim invalidations of instruction cache lines will be dominated by conflicts on a few cores. If each core is flushing the entire L2, then the instruction cache invalidations will be infrequent enough that they should not cause trouble. At 165 GB/s, each core is moving 2.75 GB/s, or 2.5 Bytes/cycle. At this rate it will take almost 210,000 cycles to flush the L2 cache, which corresponds to ~750 instruction cache refills at 275 cycles each. If the loop has (for example), 8 cache lines of instructions, then you would expect to see no more than about 1% stall cycles to reload those instructions after each L2 flush. This analysis does suggest that codes with both very large loops (>100 cache lines) and very high L2 flushing rates might see more than 10% performance degradation due to this mechanism. But, I expect the more common case to be one in which each core is focused on a relatively small fraction of the L2 cache, with more than 8 arrays accessed in the "hot" loops. Each core would then flush its fraction of the L2 much more rapidly, and if any of those cores had an alignment between the physical addresses of the "hot" loop code and the "hot" section of the L2 cache, one would expect it to repeatedly evict its instructions and suffer significant slowdown. If the problem is limited to a small number of loops, one could work around the problem by creating multiple copies of the text (code) that are separated by at least 4KiB (so that they would map to different physical pages), and then choosing which one to run based on run-time performance measurements. Ugly, but probably supportable even with the limited functionality of Linux for controlling virtual to physical mappings.

Patrick_S_ · ‎08-18-2014

>If you are confident that you can exercise all of the code paths with a simple set of tests, then just checking to see if the code gives the expected answers is certainly the easiest approach.

I will try that first!

>This analysis does suggest that codes with both very large loops (>100 cache lines) and very high L2 flushing rates might see more than 10% performance degradation due to this mechanism.

I really think this is the case for my code! Actually, I had only some months ago access to a Xeon Phi with icache_snoop_off and I saw about 10% increase in performance. However, the code has changed during the last months and most likely fit even more into this criteria. I will get access to such a Xeon Phi in some days/weeks. I will report my results.

Patrick_S_ · ‎08-28-2014

On a 5110P turning off the instruction cache snooping gives a 6% performance increase in my code. The kernel is just one big OpenMP for loop with:

192 cache line loads

13440 single-precision FLOPs

no branches

I used 4 threads per core.