In one of the topics (https://community.intel.com/t5/Software-Tuning-Performance/The-maximum-value-of-the-MEM-UOPS-RETIRED-ALL-LOADS-event-per/mp/1236536 # M7751) I asked about the theoretically achievable values of MEM_UOPS_RETIRED: ALL_LOADS / STORE events per second.
After running synthetic tests with sequential and random access to memory, we got the maximum results on 12 threads (cores) with sequential access: for the LOAD operation - 42.95 billion per second, STORE - 26.2 billion per second. These values are almost 50% of the theoretical maximum. Also, for 1 core, the values are 2 times less than the theoretical ones. Vector instructions are not used, the processor frequency is 3.1 GHz on all cores, each core has its own data vector and all data is placed in L1 cache. PAPI is used to measure the values of the counters.
Why do we get exactly 2 times less values? In the assembler code that the compiler produces, the movq and movl operations are used for different data types. Are there any other restrictions on the theoretical limit?
Plenty of people have been able to reproduce the two loads per cycle on Haswell, so the hardware is certainly capable if the user sets everything up correctly.
Problems can arise in many areas -- some of which are reasonably well-known, but others can be more obscure.
At a minimum, I would recommend reviewing:
- process pinning
- measured time vs timer overhead
- PAPI's overhead may be much higher than you expect -- I see an average of almost 2000 cycles on an Intel Cascade Lake system.
- Load alignment
- My experiments on Xeon E5 v3 show that the core can perform two loads per cycle of any size or alignment as long as neither crosses a cache line boundary. If any load crosses a cache line boundary, throughput drops to 1 load per cycle (from the L1).
Some notes at https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/ may be helpful.