topic Glad to hear it! in Software Tuning, Performance Optimization & Platform Monitoring

Sandy Bridge i5-2500k uncore ARB events

Ying_Y_ — Tue, 08 Mar 2016 16:58:46 GMT

Hi All,

I read Intel SDM and found out two uncore events: UNC_ARB_TRK_REQUEST.ALL and UNC_ARB_TRK_OCCUPANCY.ALL. From my understanding, ARB occupancy refers to memory requests waiting to be serviced by iMC. So average memory request latency can be calculated as UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUEST.ALL. Is that correct?

I tried to monitor the memory request latency by setting up MSRs in the following way:

/* enable uncore monitor */
wrmsr(MSR_UNC_PERF_GLOBAL_CTRL, 0x20000000);

int i;

/* need to lock bus to initialize occupancy */

asm volatile ("xchgl %0, %%eax"::"m" (i):"eax");

wrmsr(MSR_UNC_ARB_PERFEVTSEL0, 0x80 | (0x01 << 8) | (0x01 << 22));

wrmsr(MSR_UNC_ARB_PERFEVTSEL1, 0x81 | (0x01 << 8) | (0x01 << 22));

Periodically I checked the latency value by doing:

arb_occupancy = rdmsr(MSR_UNC_ARB_PER_CTR0);

arb_request = rdmsr(MSR_UNC_ARB_PER_CTR1);

latency = arb_occupancy / arb_request;

I ran two benchmarks A and B. A is sequential memory access (int array[...], size much larger than L3 cache, 32 bit program):

while (1) {
  for (i = 0; i < SIZE; i ++) {
    tmp += i;
    array += i;
    tmp = (tmp + 1) % 1030;
  }
}

B is pseudo random memory access:

while (1) {
  for (k = 0; k < 1030; k++) {
    j = 0;
    for (i = k; i < SIZE; i += 1030) {
      index = i + j;
      array[index] += index;
      j = (j + 1) % 1030;
    }
  }
}

I ran A or B in the idle system separately and monitored their latencies. I would expect sequential one has smaller latency than random one. But they were the same.

So did I misunderstand those ARB events? Or set those MSRs in the wrong way? Or issue with my benchmarks? Any thought would be much appreciated! Thanks!

Hi Ying Y.

A_T_Intel — Wed, 16 Mar 2016 00:27:26 GMT

Hi Ying Y.

The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL). This will give you the latency in terms of core clocks.

Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases?

If so, my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers.

If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.

Quote:Perry Taylor (Intel)

Ying_Y_ — Fri, 18 Mar 2016 04:05:07 GMT

Perry Taylor (Intel) wrote:

Hi Ying Y.

The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL). This will give you the latency in terms of core clocks.

Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases?

If so, my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers.

If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.

Thanks Perry! I think I figured out! The problem is, for the pseudo random access, adjacent accesses go to different memory banks, leading to better bank level parallelism.

Glad to hear it!

A_T_Intel — Mon, 21 Mar 2016 16:27:30 GMT

Glad to hear it!