I was trying to calculate the time my application spends in the cache by using the published specs of 4 cycles in L1, 12 in L2, etc., along with the number of L1, L2 loads that my application makes. I realized that since sandy bridge can issue 2 loads in one cycle, my results were wonky. My question is: how do I go about calculating the time spent in cache alone, given that there a parallel loads possible to L1?
Intel processors use what is known as Out of Order architecture. This means that there are many instructions that are active at any given time. When an instruction misses in a cache level - or is delayed for any other reason - and has to wait, another instruction maybe able to progress, and so it is not possible to assume that while an instruction is waiting for cache response nothing else is happening.