Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

PCM - Memory traffic and Flops

Friedrich_G_
Beginner
128 Views

Hi All,

I am using the roofline model to measure the performance of my code...hence I am measuring flops, cycles and of course memory traffic.

To test my measurement techniques I have been testing the BLAS routines daxpy, dgemv, and dgemm.

Initially I used PAPI, finding that it gives a good measurement for both FLOPS and total cycles. However when measuring memeory traffic, through the means of LLC cachem misses, it proved less tuseful.

Having read a few things I realise this is due in part to the lack of uncore events etc. Ok, so I have now used PCM to measure DRAM movements with the code attached:

int main()
{ double *X, *Y;
  int incx=1, incy=1, n;
  double alpha;
  cout << "Choose Array Size:" << endl;
  cin >> n;

// open a file in write mode.
  ofstream datFile;
  datFile.open("output.txt", ofstream::out | ofstream::app);

  X = new double;
  Y = new double;


// Initialise prng seed
  srand(time(NULL));


// Define alpha
  alpha = fRand(0, 5);

// Define Matrix Content
  init_V(X, n);

//  for (int j = 0; j<REPEATS; ++j)

  PCM * m = PCM::getInstance();
  PCM::ErrorCode returnResult = m->program();
  if (returnResult != PCM::Success)
  { std::cerr << "Intel's PCM couldn't start" << std::endl;
    std::cerr << "Error code: " << returnResult << std::endl;
    exit(1);
  }


  SystemCounterState before_sstate = getSystemCounterState();
    for(int i = 0; i<RUNS; ++i)
    { cblas_daxpy(n, alpha, X, incx, Y, incy);
    }

  SystemCounterState after_sstate = getSystemCounterState();

  datFile << left << setw(15) << getBytesReadFromMC(before_sstate,after_sstate) << setw(15) << endl;
  return 0;
}

This seems to give a good measurement as my operational intensity is now as expected for daxpy, dgemv and dgemm...so why the post?

Looking at the implementation of getBytesReadFromMC it seems a little "too easy" as I cannot see any mention of the things I think I am measuring. Which for my Sandy Bridge Architecture I think is the following:

UNC_CBO_CACHE_LOOKUP.I UNC_CBO_CACHE_LOOKUP.ANY_REQUEST_FILTER UNC_ARB_TRK_REQUEST.EVICTIONS

Not only this but I would like possibly to include the flops measurement in the PCM code and cannot find a way of doing this either. What I am trying to measure is FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE

I don't see how I link that which I wish to measure to the PCM counters...

I hope this post was clear in its message and I look forward to your response,

Friedrich

0 Kudos
1 Reply
McCalpinJohn
Black Belt
128 Views

I don't know how to configure PCM to count alternate events, but for the Sandy Bridge platform you need to be careful with the FLOP-counting events.   These events (both FP_COMP_OPS_EXE.* and SIMD_FP_256.*) will overcount significantly if there are memory stalls in loading the data used as inputs for the arithmetic instructions.   I discuss this issue as well as a workaround at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring...

 

Reply