Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Sandy Bridge i5-2500k uncore ARB events

Ying_Y_
Beginner
593 Views

Hi All,

I read Intel SDM and found out two uncore events: UNC_ARB_TRK_REQUEST.ALL and UNC_ARB_TRK_OCCUPANCY.ALL. From my understanding, ARB occupancy refers to memory requests waiting to be serviced by iMC. So average memory request latency can be calculated as  UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUEST.ALL. Is that correct?

I tried to monitor the memory request latency by setting up MSRs in the following way:

/* enable uncore monitor */
wrmsr(MSR_UNC_PERF_GLOBAL_CTRL, 0x20000000);

int i;

/* need to lock bus to initialize occupancy */

asm volatile ("xchgl %0, %%eax"::"m" (i):"eax");

wrmsr(MSR_UNC_ARB_PERFEVTSEL0, 0x80 | (0x01 << 8) | (0x01 << 22));

wrmsr(MSR_UNC_ARB_PERFEVTSEL1, 0x81 | (0x01 << 8) | (0x01 << 22));

 

Periodically I checked the latency value by doing:

arb_occupancy = rdmsr(MSR_UNC_ARB_PER_CTR0);

arb_request = rdmsr(MSR_UNC_ARB_PER_CTR1);

latency = arb_occupancy / arb_request;

 

I ran two benchmarks A and B. A is sequential memory access (int array[...], size much larger than L3 cache, 32 bit program):

while (1) {
  for (i = 0; i < SIZE; i ++) {
    tmp += i;
    array += i;
    tmp = (tmp + 1) % 1030;
  }
}

B is pseudo random memory access:

while (1) {
  for (k = 0; k < 1030; k++) {
    j = 0;
    for (i = k; i < SIZE; i += 1030) {
      index = i + j;
      array[index] += index;
      j = (j + 1) % 1030;
    }
  }
}

 

I ran A or B in the idle system separately and monitored their latencies. I would expect sequential one has smaller latency than random one. But they were the same. 

So did I misunderstand those ARB events? Or set those MSRs in the wrong way? Or issue with my benchmarks? Any thought would be much appreciated! Thanks!

0 Kudos
3 Replies
A_T_Intel
Employee
593 Views

Hi Ying Y.

The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL).  This will give you the latency in terms of core clocks. 

Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases? 

If so, my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers. 

If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.

 

 

0 Kudos
Ying_Y_
Beginner
593 Views

Perry Taylor (Intel) wrote:

Hi Ying Y.

The formula you have for the average memory request latency looks correct to me (UNC_ARB_TRK_OCCUPANCY.ALL / UNC_ARB_TRK_REQUESTS.ALL).  This will give you the latency in terms of core clocks. 

Are you getting reasonable values for UNC_ARB_TRK_REQUESTS.ALL for both cases? 

If so, my first thought is that the hardware prefetchers are adequately prefetching both cases. I would first try to disable the hardware prefetchers. 

If not, then I'd look at the assembly and make sure the compiler is doing what you expect with your code.

 

Thanks Perry! I think I figured out! The problem is, for the pseudo random access, adjacent accesses go to different memory banks, leading to better bank level parallelism.

0 Kudos
A_T_Intel
Employee
593 Views

Glad to hear it!

0 Kudos
Reply