Sandy Bridge i5-2500k uncore ARB events

Ying_Y_ · ‎03-08-2016

Hi All,

I read Intel SDM and found out two uncore events: UNC_ARB_TRK_REQUEST.ALL and UNC_ARB_TRK_OCCUPANCY.ALL. From my understanding, ARB occupancy refers to memory requests waiting to be serviced by iMC. So average memory request latency can be calculated as UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUEST.ALL. Is that correct?

I tried to monitor the memory request latency by setting up MSRs in the following way:

/* enable uncore monitor */
wrmsr(MSR_UNC_PERF_GLOBAL_CTRL, 0x20000000);

int i;

/* need to lock bus to initialize occupancy */

asm volatile ("xchgl %0, %%eax"::"m" (i):"eax");

wrmsr(MSR_UNC_ARB_PERFEVTSEL0, 0x80 | (0x01 << 8) | (0x01 << 22));

wrmsr(MSR_UNC_ARB_PERFEVTSEL1, 0x81 | (0x01 << 8) | (0x01 << 22));

Periodically I checked the latency value by doing:

arb_occupancy = rdmsr(MSR_UNC_ARB_PER_CTR0);

arb_request = rdmsr(MSR_UNC_ARB_PER_CTR1);

latency = arb_occupancy / arb_request;

I ran two benchmarks A and B. A is sequential memory access (int array[...], size much larger than L3 cache, 32 bit program):

while (1) {
  for (i = 0; i < SIZE; i ++) {
    tmp += i;
    array += i;
    tmp = (tmp + 1) % 1030;
  }
}

B is pseudo random memory access:

while (1) {
  for (k = 0; k < 1030; k++) {
    j = 0;
    for (i = k; i < SIZE; i += 1030) {
      index = i + j;
      array[index] += index;
      j = (j + 1) % 1030;
    }
  }
}

I ran A or B in the idle system separately and monitored their latencies. I would expect sequential one has smaller latency than random one. But they were the same.

So did I misunderstand those ARB events? Or set those MSRs in the wrong way? Or issue with my benchmarks? Any thought would be much appreciated! Thanks!

A_T_Intel · ‎03-15-2016

Hi Ying Y.

The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL). This will give you the latency in terms of core clocks.

Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases?

If so, my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers.

If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.

Ying_Y_ · ‎03-17-2016

Perry Taylor (Intel) wrote:

Hi Ying Y.

The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL). This will give you the latency in terms of core clocks.

Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases?

If so, my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers.

If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.

Thanks Perry! I think I figured out! The problem is, for the pseudo random access, adjacent accesses go to different memory banks, leading to better bank level parallelism.

A_T_Intel · ‎03-21-2016

Glad to hear it!