<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic PCM - Memory traffic and Flops in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/PCM-Memory-traffic-and-Flops/m-p/1085380#M5560</link>
    <description>&lt;P&gt;Hi All,&lt;/P&gt;

&lt;P&gt;I am using the roofline model to measure the performance of my code...hence I am measuring flops, cycles and of course memory traffic.&lt;/P&gt;

&lt;P&gt;To test my measurement techniques I have been testing the BLAS routines daxpy, dgemv, and dgemm.&lt;/P&gt;

&lt;P&gt;Initially I used PAPI, finding that it gives a good measurement for both FLOPS and total cycles. However when measuring memeory traffic, through the means of LLC cachem misses, it proved less tuseful.&lt;/P&gt;

&lt;P&gt;Having read a few things I realise this is due in part to the lack of uncore events etc. Ok, so I have now used PCM to measure DRAM movements with the code attached:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;int main()
{ double *X, *Y;
  int incx=1, incy=1, n;
  double alpha;
  cout &amp;lt;&amp;lt; "Choose Array Size:" &amp;lt;&amp;lt; endl;
  cin &amp;gt;&amp;gt; n;

// open a file in write mode.
  ofstream datFile;
  datFile.open("output.txt", ofstream::out | ofstream::app);

  X = new double&lt;N&gt;;
  Y = new double&lt;N&gt;;


// Initialise prng seed
  srand(time(NULL));


// Define alpha
  alpha = fRand(0, 5);

// Define Matrix Content
  init_V(X, n);

//  for (int j = 0; j&amp;lt;REPEATS; ++j)

  PCM * m = PCM::getInstance();
  PCM::ErrorCode returnResult = m-&amp;gt;program();
  if (returnResult != PCM::Success)
  { std::cerr &amp;lt;&amp;lt; "Intel's PCM couldn't start" &amp;lt;&amp;lt; std::endl;
    std::cerr &amp;lt;&amp;lt; "Error code: " &amp;lt;&amp;lt; returnResult &amp;lt;&amp;lt; std::endl;
    exit(1);
  }


  SystemCounterState before_sstate = getSystemCounterState();
    for(int i = 0; i&amp;lt;RUNS; ++i)
    { cblas_daxpy(n, alpha, X, incx, Y, incy);
    }

  SystemCounterState after_sstate = getSystemCounterState();

  datFile &amp;lt;&amp;lt; left &amp;lt;&amp;lt; setw(15) &amp;lt;&amp;lt; getBytesReadFromMC(before_sstate,after_sstate) &amp;lt;&amp;lt; setw(15) &amp;lt;&amp;lt; endl;
  return 0;
}
&lt;/N&gt;&lt;/N&gt;&lt;/PRE&gt;

&lt;P&gt;This seems to give a good measurement as my operational intensity is now as expected for daxpy, dgemv and dgemm...so why the post?&lt;/P&gt;

&lt;P&gt;Looking at the implementation of getBytesReadFromMC it seems a little "too easy" as I cannot see any mention of the things I think I am measuring. Which for my Sandy Bridge Architecture I think is the following:&lt;/P&gt;

&lt;P&gt;UNC_CBO_CACHE_LOOKUP.I UNC_CBO_CACHE_LOOKUP.ANY_REQUEST_FILTER UNC_ARB_TRK_REQUEST.EVICTIONS&lt;/P&gt;

&lt;P&gt;Not only this but I would like possibly to include the flops measurement in the PCM code and cannot find a way of doing this either. What I am trying to measure is FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE&lt;/P&gt;

&lt;P&gt;I don't see how I link that which I wish to measure to the PCM counters...&lt;/P&gt;

&lt;P&gt;I hope this post was clear in its message and I look forward to your response,&lt;/P&gt;

&lt;P&gt;Friedrich&lt;/P&gt;</description>
    <pubDate>Wed, 24 Feb 2016 08:24:42 GMT</pubDate>
    <dc:creator>Friedrich_G_</dc:creator>
    <dc:date>2016-02-24T08:24:42Z</dc:date>
    <item>
      <title>PCM - Memory traffic and Flops</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/PCM-Memory-traffic-and-Flops/m-p/1085380#M5560</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;

&lt;P&gt;I am using the roofline model to measure the performance of my code...hence I am measuring flops, cycles and of course memory traffic.&lt;/P&gt;

&lt;P&gt;To test my measurement techniques I have been testing the BLAS routines daxpy, dgemv, and dgemm.&lt;/P&gt;

&lt;P&gt;Initially I used PAPI, finding that it gives a good measurement for both FLOPS and total cycles. However when measuring memeory traffic, through the means of LLC cachem misses, it proved less tuseful.&lt;/P&gt;

&lt;P&gt;Having read a few things I realise this is due in part to the lack of uncore events etc. Ok, so I have now used PCM to measure DRAM movements with the code attached:&lt;/P&gt;

&lt;PRE class="brush:bash;"&gt;int main()
{ double *X, *Y;
  int incx=1, incy=1, n;
  double alpha;
  cout &amp;lt;&amp;lt; "Choose Array Size:" &amp;lt;&amp;lt; endl;
  cin &amp;gt;&amp;gt; n;

// open a file in write mode.
  ofstream datFile;
  datFile.open("output.txt", ofstream::out | ofstream::app);

  X = new double&lt;N&gt;;
  Y = new double&lt;N&gt;;


// Initialise prng seed
  srand(time(NULL));


// Define alpha
  alpha = fRand(0, 5);

// Define Matrix Content
  init_V(X, n);

//  for (int j = 0; j&amp;lt;REPEATS; ++j)

  PCM * m = PCM::getInstance();
  PCM::ErrorCode returnResult = m-&amp;gt;program();
  if (returnResult != PCM::Success)
  { std::cerr &amp;lt;&amp;lt; "Intel's PCM couldn't start" &amp;lt;&amp;lt; std::endl;
    std::cerr &amp;lt;&amp;lt; "Error code: " &amp;lt;&amp;lt; returnResult &amp;lt;&amp;lt; std::endl;
    exit(1);
  }


  SystemCounterState before_sstate = getSystemCounterState();
    for(int i = 0; i&amp;lt;RUNS; ++i)
    { cblas_daxpy(n, alpha, X, incx, Y, incy);
    }

  SystemCounterState after_sstate = getSystemCounterState();

  datFile &amp;lt;&amp;lt; left &amp;lt;&amp;lt; setw(15) &amp;lt;&amp;lt; getBytesReadFromMC(before_sstate,after_sstate) &amp;lt;&amp;lt; setw(15) &amp;lt;&amp;lt; endl;
  return 0;
}
&lt;/N&gt;&lt;/N&gt;&lt;/PRE&gt;

&lt;P&gt;This seems to give a good measurement as my operational intensity is now as expected for daxpy, dgemv and dgemm...so why the post?&lt;/P&gt;

&lt;P&gt;Looking at the implementation of getBytesReadFromMC it seems a little "too easy" as I cannot see any mention of the things I think I am measuring. Which for my Sandy Bridge Architecture I think is the following:&lt;/P&gt;

&lt;P&gt;UNC_CBO_CACHE_LOOKUP.I UNC_CBO_CACHE_LOOKUP.ANY_REQUEST_FILTER UNC_ARB_TRK_REQUEST.EVICTIONS&lt;/P&gt;

&lt;P&gt;Not only this but I would like possibly to include the flops measurement in the PCM code and cannot find a way of doing this either. What I am trying to measure is FP_COMP_OPS_EXE.SSE_FP_SCALAR_DOUBLE&lt;/P&gt;

&lt;P&gt;I don't see how I link that which I wish to measure to the PCM counters...&lt;/P&gt;

&lt;P&gt;I hope this post was clear in its message and I look forward to your response,&lt;/P&gt;

&lt;P&gt;Friedrich&lt;/P&gt;</description>
      <pubDate>Wed, 24 Feb 2016 08:24:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/PCM-Memory-traffic-and-Flops/m-p/1085380#M5560</guid>
      <dc:creator>Friedrich_G_</dc:creator>
      <dc:date>2016-02-24T08:24:42Z</dc:date>
    </item>
    <item>
      <title>I don't know how to configure</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/PCM-Memory-traffic-and-Flops/m-p/1085381#M5561</link>
      <description>&lt;P&gt;I don't know how to configure PCM to count alternate events, but for the Sandy Bridge platform you need to be careful with the FLOP-counting events.&amp;nbsp;&amp;nbsp; These events (both FP_COMP_OPS_EXE.* and SIMD_FP_256.*) will overcount significantly if there are memory stalls in loading the data used as inputs for the arithmetic instructions.&amp;nbsp;&amp;nbsp; I discuss this issue as well as a workaround at &lt;A href="https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/564455" target="_blank"&gt;https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/564455&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 24 Feb 2016 15:44:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/PCM-Memory-traffic-and-Flops/m-p/1085381#M5561</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2016-02-24T15:44:50Z</dc:date>
    </item>
  </channel>
</rss>

