Hello Animesh, - Intel Community

Software Tuning, Performance Optimization & Platform Monitoring

Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Run to Run variability on an Intel(R) Xeon(R) CPU E3-1240 v3

821 Views

Hi,

I have been struggling to get reproducible results in a very simple Matrix multiplication code. I see variability of even more than 10% from run to run. I have run the perf stat command to monitor the runs.

1024x1024

Performance counter stats for 'taskset 0x1 MM_binaries/MM_tiled_1024':

198,505,412,302 cycles # 0.000 GHz [57.14%]
283,932,630,578 instructions # 1.43 insns per cycle [71.42%]
111,811,914,939 L1-dcache-loads [71.43%]
1,091,387,355 L1-dcache-load-misses # 0.98% of all L1-dcache hits [71.43%]
1,088,112,128 r504f2e [71.44%]
543,287,591 r50412e [71.43%]
0 LLC-prefetches [57.14%]

71.139608212 seconds time elapsed

Performance counter stats for 'taskset 0x1 MM_binaries/MM_tiled_1024':

236,109,612,641 cycles # 0.000 GHz [57.14%]
283,941,246,019 instructions # 1.20 insns per cycle [71.43%]
111,814,428,762 L1-dcache-loads [71.43%]
1,091,047,791 L1-dcache-load-misses # 0.98% of all L1-dcache hits [71.43%]
1,087,967,282 r504f2e [71.43%]
775,796,012 r50412e [71.43%]
0 LLC-prefetches [57.14%]

84.624840119 seconds time elapsed

Performance counter stats for 'taskset 0x1 MM_binaries/MM_tiled_1024':

234,859,607,740 cycles # 0.000 GHz [57.14%]
283,930,872,557 instructions # 1.21 insns per cycle [71.43%]
111,817,249,748 L1-dcache-loads [71.43%]
1,091,064,977 L1-dcache-load-misses # 0.98% of all L1-dcache hits [71.43%]
1,088,135,424 r504f2e [71.43%]
769,755,541 r50412e [71.43%]
0 LLC-prefetches [57.14%]

84.179363339 seconds time elapsed

Performance counter stats for 'taskset 0x1 MM_binaries/MM_tiled_1024':

236,559,411,636 cycles # 0.000 GHz [57.14%]
283,950,536,574 instructions # 1.20 insns per cycle [71.43%]
111,814,194,724 L1-dcache-loads [71.43%]
1,090,901,732 L1-dcache-load-misses # 0.98% of all L1-dcache hits [71.43%]
1,087,930,888 r504f2e [71.43%]
777,946,110 r50412e [71.43%]
0 LLC-prefetches [57.14%]

84.782145759 seconds time elapsed

Performance counter stats for 'taskset 0x1 MM_binaries/MM_tiled_1024':

198,639,656,963 cycles # 0.000 GHz [57.14%]
283,910,621,526 instructions # 1.43 insns per cycle [71.43%]
111,812,205,161 L1-dcache-loads [71.43%]
1,091,489,249 L1-dcache-load-misses # 0.98% of all L1-dcache hits [71.43%]
1,087,716,621 r504f2e [71.43%]
543,215,397 r50412e [71.43%]
0 LLC-prefetches [57.13%]

71.188993135 seconds time elapsed

As you can 2 runs have a runtime of 71 whereas other 3 runs have a runtime of 84 seconds. Apart from that 2 lines from each runs need explanation 1,087,716,621 r504f2e [71.43%]
543,215,397 r50412e [71.43%]

First one is signifying number of LLC-reads and second one is number of LLC_misses.

Since LLC_misses are proportional to runtime here, I checked whether aligning the data to a page boundary helps. It did not. I was still seeing variability. I have used cpufreq-utils to set core frequency to a constant. I am using taskset to pin the application to a core and don't have any other application running to avoid interference.

Can anyone suggest why there is a run to run variability when the data is exact same? Why is there so much variance?

Link Copied

2 Replies

821 Views

Hello Animesh,

Have you figured this out yet? I suppose you ruled out that something else isn't running in the slow cases. My guess is that the slowness is caused by the increased LLC misses. I've seen cases like this where the relative alignment of some of the arrays leads to cache evictions.

In particular, your L1 has 8 ways of size 4 KB so it can hold at most 8 references that are 'modulo 4KB' apart. The replacement policy is pseudo-LRU so sometimes not even 8 things can be held in the L1 without having to evict some cache line.

I would check the relative alignment (modulo 4KB) of whatever arrays you using. The stack locations also come into play here. If something is getting pushed/popped to the stack then the stack read/writes may be causing evictions. I would use something like VTune to tell me where in your code the evictions are coming from as a starting point. You might have to malloc one large array and then subdivide it into smaller arrays with the proper alignment.

Hope this helps,

Pat

Copy link

821 Views

A common source of run-to-run variability occurs due to random page coloring when attempting to use data that requires most of the cache. The easiest way to identify this as the source of the variability is to run with large pages. (I usually use mmap() with the MAP_HUGETLB option to get large pages, but you typically need root access to enable the large pages.)

What processor model is this running on?

I did some experiments looking at L3 miss rates on a Xeon E5-2680 processor as a function of array size when using the default 4KiB page size. Miss rates started exceeding zero when I tried to use more than 1/2 of the 20 MiB L3 cache, and increased to ~20% when attempting to use 80% of the LLC. If your code is a simple matrix multiplication that is not blocked, then this could easily account for a 10% performance difference.

Copy link

Community support is provided Monday to Friday. Other contact methods are available here.

Intel does not verify all solutions, including but not limited to any file transfers that may appear in this community. Accordingly, Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

For more complete information about compiler optimizations, see our Optimization Notice.