Solved: How hardware prefetcher change load and store buffer behavior in processor pipeline

Zhu_G_ · ‎06-29-2015

Hi, Community! I am experimenting with XEON E5620 dual socket server. I perf with event RESOURCE_STALLS.LOAD and RESOURCE_STALLS.STORE in SDM page 2699 of chapter 19.7.

I first turned off hardware prefetch following instructions on url: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

the instruction I used is : wrmsr -a 0x1a4 0xf then I used perf command as: perf stat -e ra202,ra208 ./fft-m26

the result is : 201 ra202 10,215,615 ra208 29.485852761 seconds time elapsed

then I enabled hardware prefetch using : wrmsr -a 0x1a4 0x0 again I perf with:　perf stat -e ra202,ra208 ./fft-m26

As I wished I get better performance,

the result is : 2,206 ra202 18,970,999 ra208 24.963877684 seconds time elapsed

But I observed that it seems the pipeline has stalled more on load buffer and store buffer. Why is this?

McCalpinJohn · ‎06-30-2015

1. You might have omitted this to keep the posting short, but when disabling the hardware prefetchers, be sure that you run the wrmsr command on each core, or make sure that the job execution is bound to the core(s) where you ran the wrmsr command.

2. The first step in any analysis is to review the relative sizes of the various values to see if they are of the correct order of magnitude to be relevant.

Your execution time difference is about 4.5 seconds, which is almost 11 Billion cycles at 2.4 GHz.
The difference in Event/Umask 0xA2/0x02 is only 2005, about 1 part in 5 million.
The difference in Event/Umask 0xA2/0x08 is just under 9 million, which is less than 0.1% of the change in cycle count.

Simply comparing these sizes should make it clear that these counters are not counting something that could be relevant to the change in performance.

View solution in original post

McCalpinJohn · ‎06-30-2015

1. You might have omitted this to keep the posting short, but when disabling the hardware prefetchers, be sure that you run the wrmsr command on each core, or make sure that the job execution is bound to the core(s) where you ran the wrmsr command.

2. The first step in any analysis is to review the relative sizes of the various values to see if they are of the correct order of magnitude to be relevant.

Your execution time difference is about 4.5 seconds, which is almost 11 Billion cycles at 2.4 GHz.
The difference in Event/Umask 0xA2/0x02 is only 2005, about 1 part in 5 million.
The difference in Event/Umask 0xA2/0x08 is just under 9 million, which is less than 0.1% of the change in cycle count.

Simply comparing these sizes should make it clear that these counters are not counting something that could be relevant to the change in performance.

Zhu_G_ · ‎06-30-2015

Thank you Dr. Bandwidth! I manage hardware prefetch using -a option with msr-tools so that hardware prefetch is disabled for every core.Before disabling the hardware prefetcher, I exeriment with affinity as following:

Performance counter stats for 'numactl -i 0 -C 8 ./fft -m26':

4,139 ra202
11,814,108 ra208
350,359,455 LLC-load-misses
252,274,319 LLC-store-misses

25.042315575 seconds time elapsed

Then I disabled the hardware prefetcher. Then here it goes...

Performance counter stats for 'numactl -i 0 -C 8 ./fft -m26':

1,285 ra202
9,949,817 ra208
592,568,257 LLC-load-misses
252,466,386 LLC-store-misses

29.526919502 seconds time elapsed

I have got less stall for load buffer in pipeline but more LLC-load-misses. it seems that LLC-load-misses is dominating the performance.

Now I wonder how does event ra202 and ra208 reflect memory access behavior, do they reflect all memory access (including cache and memory) delay to the processor pipeline?

I also wonder that whether I should calculate cycle time using 2.4GHz or using 1.6GHz?

model name   : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
stepping   : 2
cpu MHz       : 1600.000

Definition from SDM:

RESOURCE_STALLS.ANY : ra202

RESOURCE_STALLS.STORE : ra208

John D. McCalpin wrote:

1. You might have omitted this to keep the posting short, but when disabling the hardware prefetchers, be sure that you run the wrmsr command on each core, or make sure that the job execution is bound to the core(s) where you ran the wrmsr command.

2. The first step in any analysis is to review the relative sizes of the various values to see if they are of the correct order of magnitude to be relevant.

Your execution time difference is about 4.5 seconds, which is almost 11 Billion cycles at 2.4 GHz.

The difference in Event/Umask 0xA2/0x02 is only 2005, about 1 part in 5 million.

The difference in Event/Umask 0xA2/0x08 is just under 9 million, which is less than 0.1% of the change in cycle count.

Simply comparing these sizes should make it clear that these counters are not counting something that could be relevant to the change in performance.

McCalpinJohn · ‎07-01-2015

The "cpu MHz" field from /proc/cpuinfo often gives the wrong value if automatic p-state control is enabled. The 1600 MHz reported is probably the frequency that the processor runs at when the OS has no work for it, while 2400 MHz is the nominal frequency. If "Turbo Mode" is enabled, then the actual frequency when running the job may be higher. You can compute the actual average frequency using the events CPU_CLK_UNHALTED.THREAD_P and CPU_CLK_UNHALTED.REF_P.

The increase in LLC store misses here is small, so it is not a major contributor. This may be because of the specific memory reference patterns of the code or it may be because hardware prefetchers are typically less aggressive at prefetching for stores.

The increase in LLC load misses is big enough to be important. The difference in execution time is 4.48 seconds, or 10.8 billion cycles. The difference in LLC load miss counts is 242 million. The ratio is about 44 additional cpu cycles per LLC miss, which is a completely reasonable number. L3 latency on the E5620 should be in the range of 35-40 cycles, or about 16 ns. Memory latency on the E5620 should be in the range of 65-70 ns, giving a difference of 50 ns, or 120 cycles. This is (in rough terms) the *largest* amount that an extra LLC miss should cost (excluding special cases that I am not going to talk about here). It is easy to get *smaller* effective costs for these LLC misses if there are multiple LLC misses happening at the same time. In this case an observed increase of 44 cycles per LLC miss is consistent with having about 3 concurrent LLC misses outstanding at all times. It is easy for a hardware prefetcher to generate 3 more concurrent fetches than the core creates on its own, so the observed increase in execution time when the hardware prefetchers are disabled is well within the expected range of values.