Performance impact of DTLB misses

Roman_A_ · ‎01-16-2015

I apologise in advance if I'm posting in a wrong forum.

I'm having an issue (or simply a misunderstanding) with TLB miss measurement.

In order to test our measurement, I'm simply measuring a poorly optimized matrix multiplication of two 1000x1000 matrices with the code below:

/**
* Performs a parallelized matrix multiplication in I, J, K order.
*/
void Parallel_IJK(int n, int** a, int** b, int** c){
   parallel_for (0, n, [&](int i)
   {
      for (int j=0; j<n; j++) {
         int sum = 0;
         for (int k=0; k<n; k++)
            sum += a * b;
            c = sum;
         }
      });
}

I'm measuring it on the following Intel microprocessor: http://ark.intel.com/products/37150/Intel-Core-i7-950-Processor-8M-Cache-3_06-GHz-4_80-GTs-Intel-QPI

In order to measure the counters I'm using Intel Performance Counter Monitor utility, modified for measuring TLB counters. The performance impact itself is measured according to the formula on this page 24 of this Intel presentation: ( https://software.intel.com/sites/default/files/88/fe/core-i7-processor-family-vtune-sw-opt-guide-1.1.pdf ), as: TLB MISSES = ((DTLB_LOAD_MISSES.WALK_COMPLETED * 30) / CPU_CLK_UNHALTED.THREAD).

Here's the issue, the TLB impact on a single core seems to be very closely correlated with L2 misses. My questions are:

Is it normal that TLB misses are correlated with L2 misses (there seems to be roughly the same amount of L2 misses as TLB misses) while there's very low amount of L3 misses? I'm not an expert in the domain, but to my understanding TLB misses should be correlated with L3 misses and not L2, but the data tells me otherwise.
When I sample on a high rate (=each millisecond), the TLB impact formula often gives me ratio higher than 1, sometimes even 5 (= 500%). Same goes for L2 perf impact. In the documentation it's written that "ratio that is usually between 0 and 1 ; in some cases could be >1.0 due to a lower memory latency estimation", could you elaborate?

Below are some graphs to illustrate my issue, measuring and sampling my matrix multiplication over 100 ms interval. This is just a single core of the system.

TimP · ‎01-16-2015

As tlb is not large enough to cover l3 it may not be surprising that a large fraction of l2 misses are missing tlb. I've had cases myself where l3 became ineffective as number of threads increased even though there did not appear to be significant l3 misses.

I would be concerned about impact of lack of affinity setting as one might hope to improve l2 locality.

i was assuming you have 4k pages and transparent huge pages may not save the day.