In one of the topics (https://community.intel.com/t5/Software-Tuning-Performance/The-maximum-value-of-the-MEM-UOPS-RETIRED...) I asked about the theoretically achievable values of MEM_UOPS_RETIRED: ALL_LOADS / STORE events per second.
After running synthetic tests with sequential and random access to memory, we got the maximum results on 12 threads (cores) with sequential access: for the LOAD operation - 42.95 billion per second, STORE - 26.2 billion per second. These values are almost 50% of the theoretical maximum. Also, for 1 core, the values are 2 times less than the theoretical ones. Vector instructions are not used, the processor frequency is 3.1 GHz on all cores, each core has its own data vector and all data is placed in L1 cache. PAPI is used to measure the values of the counters.
Why do we get exactly 2 times less values? In the assembler code that the compiler produces, the movq and movl operations are used for different data types. Are there any other restrictions on the theoretical limit?
Plenty of people have been able to reproduce the two loads per cycle on Haswell, so the hardware is certainly capable if the user sets everything up correctly.
Problems can arise in many areas -- some of which are reasonably well-known, but others can be more obscure.
At a minimum, I would recommend reviewing:
Some notes at https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processo... may be helpful.