- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am using the roofline model to measure the performance of my code...hence I am measuring flops, cycles and of course memory traffic.
To test my measurement techniques I have been testing the BLAS routines daxpy, dgemv, and dgemm.
Initially I used PAPI, finding that it gives a good measurement for both FLOPS and total cycles. However when measuring memeory traffic, through the means of LLC cachem misses, it proved less tuseful.
Having read a few things I realise this is due in part to the lack of uncore events etc. Ok, so I have now used PCM to measure DRAM movements with the code attached:
int main() { double *X, *Y; int incx=1, incy=1, n; double alpha; cout << "Choose Array Size:" << endl; cin >> n; // open a file in write mode. ofstream datFile; datFile.open("output.txt", ofstream::out | ofstream::app); X = new double; Y = new double ; // Initialise prng seed srand(time(NULL)); // Define alpha alpha = fRand(0, 5); // Define Matrix Content init_V(X, n); // for (int j = 0; j<REPEATS; ++j) PCM * m = PCM::getInstance(); PCM::ErrorCode returnResult = m->program(); if (returnResult != PCM::Success) { std::cerr << "Intel's PCM couldn't start" << std::endl; std::cerr << "Error code: " << returnResult << std::endl; exit(1); } SystemCounterState before_sstate = getSystemCounterState(); for(int i = 0; i<RUNS; ++i) { cblas_daxpy(n, alpha, X, incx, Y, incy); } SystemCounterState after_sstate = getSystemCounterState(); datFile << left << setw(15) << getBytesReadFromMC(before_sstate,after_sstate) << setw(15) << endl; return 0; }
This seems to give a good measurement as my operational intensity is now as expected for daxpy, dgemv and dgemm...so why the post?
Looking at the implementation of getBytesReadFromMC it seems a little "too easy" as I cannot see any mention of the things I think I am measuring. Which for my Sandy Bridge Architecture I think is the following:
UNC_CBO_CACHE_LOOKUP.I UNC_CBO_CACHE_LOOKUP.ANY_REQUEST_FILTER UNC_ARB_TRK_REQUEST.EVICTIONS
Not only this but I would like possibly to include the flops measurement in the PCM code and cannot find a way of doing this either. What I am trying to measure is FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE
I don't see how I link that which I wish to measure to the PCM counters...
I hope this post was clear in its message and I look forward to your response,
Friedrich
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't know how to configure PCM to count alternate events, but for the Sandy Bridge platform you need to be careful with the FLOP-counting events. These events (both FP_COMP_OPS_EXE.* and SIMD_FP_256.*) will overcount significantly if there are memory stalls in loading the data used as inputs for the arithmetic instructions. I discuss this issue as well as a workaround at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/564455
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page