Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

PCM - Memory traffic and Flops

Friedrich_G_
Beginner
301 Views

Hi All,

I am using the roofline model to measure the performance of my code...hence I am measuring flops, cycles and of course memory traffic.

To test my measurement techniques I have been testing the BLAS routines daxpy, dgemv, and dgemm.

Initially I used PAPI, finding that it gives a good measurement for both FLOPS and total cycles. However when measuring memeory traffic, through the means of LLC cachem misses, it proved less tuseful.

Having read a few things I realise this is due in part to the lack of uncore events etc. Ok, so I have now used PCM to measure DRAM movements with the code attached:

int main()
{ double *X, *Y;
  int incx=1, incy=1, n;
  double alpha;
  cout << "Choose Array Size:" << endl;
  cin >> n;

// open a file in write mode.
  ofstream datFile;
  datFile.open("output.txt", ofstream::out | ofstream::app);

  X = new double;
  Y = new double;


// Initialise prng seed
  srand(time(NULL));


// Define alpha
  alpha = fRand(0, 5);

// Define Matrix Content
  init_V(X, n);

//  for (int j = 0; j<REPEATS; ++j)

  PCM * m = PCM::getInstance();
  PCM::ErrorCode returnResult = m->program();
  if (returnResult != PCM::Success)
  { std::cerr << "Intel's PCM couldn't start" << std::endl;
    std::cerr << "Error code: " << returnResult << std::endl;
    exit(1);
  }


  SystemCounterState before_sstate = getSystemCounterState();
    for(int i = 0; i<RUNS; ++i)
    { cblas_daxpy(n, alpha, X, incx, Y, incy);
    }

  SystemCounterState after_sstate = getSystemCounterState();

  datFile << left << setw(15) << getBytesReadFromMC(before_sstate,after_sstate) << setw(15) << endl;
  return 0;
}

This seems to give a good measurement as my operational intensity is now as expected for daxpy, dgemv and dgemm...so why the post?

Looking at the implementation of getBytesReadFromMC it seems a little "too easy" as I cannot see any mention of the things I think I am measuring. Which for my Sandy Bridge Architecture I think is the following:

UNC_CBO_CACHE_LOOKUP.I UNC_CBO_CACHE_LOOKUP.ANY_REQUEST_FILTER UNC_ARB_TRK_REQUEST.EVICTIONS

Not only this but I would like possibly to include the flops measurement in the PCM code and cannot find a way of doing this either. What I am trying to measure is FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE

I don't see how I link that which I wish to measure to the PCM counters...

I hope this post was clear in its message and I look forward to your response,

Friedrich

0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
301 Views

I don't know how to configure PCM to count alternate events, but for the Sandy Bridge platform you need to be careful with the FLOP-counting events.   These events (both FP_COMP_OPS_EXE.* and SIMD_FP_256.*) will overcount significantly if there are memory stalls in loading the data used as inputs for the arithmetic instructions.   I discuss this issue as well as a workaround at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/564455

 

0 Kudos
Reply