topic Re: The maximum value of the MEM_UOPS_RETIRED: ALL_LOADS/STORE event per second. Haswell Xeon E5 269 in Software Tuning, Performance Optimization & Platform Monitoring

The maximum value of the MEM_UOPS_RETIRED: ALL_LOADS/STORE event per second. Haswell Xeon E5 2697 v3

gadel_zakirov — Thu, 08 Apr 2021 19:56:39 GMT

In one of the topics (https://community.intel.com/t5/Software-Tuning-Performance/The-maximum-value-of-the-MEM-UOPS-RETIRED-ALL-LOADS-event-per/mp/1236536 # M7751) I asked about the theoretically achievable values of MEM_UOPS_RETIRED: ALL_LOADS / STORE events per second.

After running synthetic tests with sequential and random access to memory, we got the maximum results on 12 threads (cores) with sequential access: for the LOAD operation - 42.95 billion per second, STORE - 26.2 billion per second. These values are almost 50% of the theoretical maximum. Also, for 1 core, the values are 2 times less than the theoretical ones. Vector instructions are not used, the processor frequency is 3.1 GHz on all cores, each core has its own data vector and all data is placed in L1 cache. PAPI is used to measure the values of the counters.

Why do we get exactly 2 times less values? In the assembler code that the compiler produces, the movq and movl operations are used for different data types. Are there any other restrictions on the theoretical limit?

Re: The maximum value of the MEM_UOPS_RETIRED: ALL_LOADS/STORE event per second. Haswell Xeon E5 269

McCalpinJohn — Wed, 14 Apr 2021 22:00:49 GMT

Plenty of people have been able to reproduce the two loads per cycle on Haswell, so the hardware is certainly capable if the user sets everything up correctly.

Problems can arise in many areas -- some of which are reasonably well-known, but others can be more obscure.

At a minimum, I would recommend reviewing:

process pinning
measured time vs timer overhead
- PAPI's overhead may be much higher than you expect -- I see an average of almost 2000 cycles on an Intel Cascade Lake system.
Load alignment
- My experiments on Xeon E5 v3 show that the core can perform two loads per cycle of any size or alignment as long as neither crosses a cache line boundary. If any load crosses a cache line boundary, throughput drops to 1 load per cycle (from the L1).

Some notes at https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/ may be helpful.