Hi, Intel support staff,
I am applying Intel 2018 Amplifier to our Intel Fortran application codes to diagnose nested do-loops. These nested do-loops cost too much time and amplifier shows that LLC miss count # is very huge. All tests with Amplifier are done on a 2-socket Skylake Dell computer with L1 cache: 2.5MB, L2 cache: 40MB and L3 cache: 55MB and DRAM 191GB.
I found the following information on the Intel Website:
The LLC (last-level cache) is the last, and longest-latency, level in the memory hierarchy before main memory (DRAM). Any memory requests missing here must be serviced by local or remote DRAM, with significant latency. The LLC Miss metric shows a ratio of cycles with outstanding LLC misses to all cycles.
A high number of CPU cycles is being spent waiting for LLC load misses to be serviced. Possible optimizations are to reduce data working set size, improve data access locality, blocking and consuming data in chunks that fit in the LLC, or better exploit hardware prefetchers. Consider using software prefetchers but they can increase latency by interfering with normal loads, and can increase pressure on the memory system.
In order to either reduce the LLC Miss Count # or completely eliminate it, could you tell me any suggestions on how to revise our nested do-loops which are written with Intel Fortran? I look forward to your help. Thanks in advance.
If you need our test codes, I may upload them on demand.
There is VTune Amplifier Cookbook article that describes one of the use cases connected with cache unfriendly memory access pattern leading to excessive DRAM traffic: https://software.intel.com/en-us/vtune-amplifier-cookbook-frequent-dram-accesses.
You can start with this while Dr. Bandwidth (John McCalpin from TACC) is writing his answer:-)
Thanks & Regards, Dmitry
Thanks for your reply and I have visited that website about frequent DRAM accesses.
Please do me a favor and run the Fortran codes attached with Amplifier. You will see the huge LLC miss count. What can I do to reduce or eliminate the LLC miss count?
I look forward to hearing from you again.
P.S. if 41001 is used to replace 4101001, then LLC miss count is seen to be 0 under Amplifier.
This is a big topic, with lots of specific cases and very little in the way of general theory.
The usual approach is to start by looking for "mistakes" (such as the strided access in the inner loop of the VTune Amplifier Cookbook article referenced above) and to fix them first. One common error that is still frequently seen arises from codes that have been ported from C to Fortran (or vice-versa). These should be inspected to make sure the loop order has been reversed so that the inner loop operates over contiguous addresses for as many arrays as possible.
If there are no obvious "mistakes", some level of analysis is required. Two values that you need to compute are:
Based on your performance results, I will assume that the memory footprint of the loop nest is larger than the available L2+L3 cache available. Your processor looks like a Xeon Gold 6138 or 6148, which have 20 MiB of (private) L2 cache and 27.5 MiB of (shared) L3 per socket. A single thread can use up to 28.5 MiB of cache -- its private 1 MiB L2 cache plus all 27.5 MiB of shared L3 cache. Using all 20 cores, a total of 47.5 MiB of L2+L3 cache is available on one chip, and using all 40 cores on both sockets allows access to 95 MiB of L2+L3 cache. So in this case, I am assuming that the total size of the arrays accessed in the loop nests is significantly larger than 95 MiB.
I find it very helpful to analyze the loop nests using a simplified (mental) model of the cache hierarchy to come up with an estimate of how much data I think is supposed to be read from DRAM and written back to DRAM. This value may be the same as the "memory footprint", or it may be quite a bit larger (if the memory access pattern allows for re-use of data from some level of the cache hierarchy). The ratio of the expected memory traffic to the number of memory locations accessed is the average data re-use ratio. If this is near one, there is typically little that can be done, while a large value almost always suggests opportunities for optimization.
Some algorithms naturally produce good data re-use in their naive form, but most require some code changes to work well. The most common optimization to improve data re-use is loop blocking (also called loop tiling). An example is provided at https://en.wikipedia.org/wiki/Loop_nest_optimization. ; For arrays with 3 or more dimensions and repeated accesses to 1-D or 2-D sub-arrays, explicit copying of these sub-arrays to/from contiguous temporary arrays can often provide performance improvements.
It is important to compare expected DRAM traffic to actual measured DRAM traffic. If the actual DRAM traffic is much higher than the expected value, then either my (mental) model is wrong, or some cache conflict is displacing the data in cases where I expect it to remain in the cache for re-use. (In the VTune example referenced above, the DRAM traffic would be elevated because each iteration of the inner loop in the naive code will load a full cache line of the "b" array, but will only use one element from that cache line. By the time the code wants to use the next (contiguous) element of the "b" array, it will have been evicted from the caches and will need to be re-loaded.)
There are only a handful of algorithms for which good bounds on the minimum required memory traffic through a cache hierarchy are known. Most codes are analyzed informally, and tweaked until the performance is "good enough". For codes that don't have "performance mistakes" in implementation, significant reductions in memory traffic often require restructuring at a larger scale. Optimizations such as "time skewing" (e.g., http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.9428) can provide large reductions in memory traffic, but are usually considered too unwieldy and/or burdensome to implement in non-trivial codes.
Thanks. inner loop in our code fragment operates over continuous address, I think. How to revise the nested loops so that data can fit the cache? thus LLC miss count become ZERO once the data fits the cache. When small size 41001 is used, the LLC miss count is ZERO. This point is confirmed. For big data size 4101001 what can I do to make LLC miss count become ZERO too?
Repeatedly accessing a large amount of data is going to cause cache misses....
Your "subroutine t1" computes a bunch of independent sums in "tempc3a" and then sums them later. If they are going to be used independently, then you may be able to split up the code so that each value of "tempc3a(itt, ica)" is computed separately immediately before its use.
This code fragment matches the format of our commercial product codes. If we can reduce LLC miss count in this code fragment, then our commercial codes can also be enhanced and improved in run time costs.
Intel's blocking loop directive (for example, !DIR$ blocking_loop factor (1024)) explicitly used in Fortran codes seems to be inefficient in terms of our tests. We want to know if Intel technology support staff can offer us a satisfactory solution in order to reduce LLC miss count (had better become ZERO ) via revising the nested do-loops "subroutine t1". Huge LLC miss counts caused the long time costs from this nested do-loops 'subroutine t1'.
We look forward to your solution. Your help will be highly appreciated.
Good morning Intel Support Staff,
Could you tell me how to revise the nested loops ('subroutine t1' in our code fragment)to reduce the number of LLC miss count?
We look forward to your solution to this problem. Thanks in advance.