Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1711 Discussions

The maximum value of the MEM_UOPS_RETIRED: ALL_LOADS/STORE event per second. Haswell Xeon E5 2697 v3

gadel_zakirov
Beginner
689 Views

In one of the topics (https://community.intel.com/t5/Software-Tuning-Performance/The-maximum-value-of-the-MEM-UOPS-RETIRED-ALL-LOADS-event-per/mp/1236536 # M7751) I asked about the theoretically achievable values ​​of MEM_UOPS_RETIRED: ALL_LOADS / STORE events per second.

After running synthetic tests with sequential and random access to memory, we got the maximum results on 12 threads (cores) with sequential access: for the LOAD operation - 42.95 billion per second, STORE - 26.2 billion per second. These values ​​are almost 50% of the theoretical maximum. Also, for 1 core, the values ​​are 2 times less than the theoretical ones. Vector instructions are not used, the processor frequency is 3.1 GHz on all cores, each core has its own data vector and all data is placed in L1 cache. PAPI is used to measure the values ​​of the counters.

Why do we get exactly 2 times less values? In the assembler code that the compiler produces, the movq and movl operations are used for different data types. Are there any other restrictions on the theoretical limit?

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
631 Views

Plenty of people have been able to reproduce the two loads per cycle on Haswell, so the hardware is certainly capable if the user sets everything up correctly.

Problems can arise in many areas -- some of which are reasonably well-known, but others can be more obscure.

At a minimum, I would recommend reviewing:

  • process pinning
  • measured time vs timer overhead 
    • PAPI's overhead may be much higher than you expect -- I see an average of almost 2000 cycles on an Intel Cascade Lake system.
  • Load alignment
    • My experiments on Xeon E5 v3 show that the core can perform two loads per cycle of any size or alignment as long as neither crosses a cache line boundary.  If any load crosses a cache line boundary, throughput drops to 1 load per cycle (from the L1).

Some notes at https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/ may be helpful.

0 Kudos
Reply