I try to estimate the execution cycles of a simple function on skylake i7.
I measured the number of memory accesses during the function execution using the MEM_INST_RETIRED.ALL_LOADS and MEM_INST_RETIRED.ALL_RESTORES, which are 61571150 and 18415653, respectively.
At the same time, I measured the cycles of the program, which is 65388197.
The measurements confuse me because based on the measurements, the average memory access latency during the execution is at most 65388197/(61571150+18415653) ~ 0.817 cycles/access, which is much smaller than the documented L1 hit latency, i.e. 4 cycles.
Any insights on why the average memory access latency could be smaller than L1 hit latency?
today's Intel Core architecture have out-of-order engines, i.e. they can process multiple instructions in parallel and not necessarily in the order of the assembly instructions. (The hardware obviously needs to ensure that the observable effects to the outside of the core are "as if" the instructions were executed in order.) In particular, the core can do multiple loads in parallel. In fact, the latest cores can execute up to two load instructions per cycle! This the case because, if possible, the core will request data from the memory sub-system beforehand so that, at the time the data is needed, it is already in the core.
There is obviously a limit to that. For example, in "dependent loads" the address of a load instruction depends on the data of a previous load. In this case, the core cannot request the data before it has received the data from the previous load. Also there are limits how many load instructions are "in flight".