<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Modifying stream benchmark to report read bandwidth in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261261#M7829</link>
    <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I'm using Dr. McCalpin's stream to measure bandwidth on various servers. It works great.&lt;/P&gt;
&lt;P&gt;I know that we can compile it with non-temporal stores and get higher reported bandwidth (since the read-for-ownership isn't done). But trying to explain this to other people who use stream and are not familiar with RFOs and non-temporal vs temporal and actual vs reported bw is hard.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And I know I can multiply the reported bw by a factor (4/3 for triad) and get a peak 'actual (as in "what linux perf would report as seen by dram")' bw... but this starts making some folks eyes glaze over.&lt;/P&gt;
&lt;P&gt;Other bw benchmarks (such as Intel's mlc) report pure read bw.&lt;/P&gt;
&lt;P&gt;Stream can be easily modified to do a pure read test (I changed the ARRAY_type to 'int' from 'double'). (and yes, I'm checking perf bw to verify that what stream_read reports is actually appearing as memory traffic). And yes I use a much larger size array to ensure I'm not hitting cache (I get about 99.5+ % L3 miss rate).&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;for (j=0; j&amp;lt;STREAM_ARRAY_SIZE; j++)&lt;BR /&gt;&lt;SPAN class="Apple-converted-space" style="font-family: inherit;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;iaccum += a[j];&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Is adding a stream_read subtest something Dr. McCalpin might consider for stream?&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;In general (for the servers I'm profiling anyway) about 10% of the theoretical bw is lost off the top (due to memory refresh, memory scrubbing and other stuff (that Dr Bandwidth know much better than I)).&lt;BR /&gt;Next my stream_read (and Intel mlc read bw) bandwidth can hit about 87-92% of theoretical bw.&lt;BR /&gt;Non-temporal store stream_triad can hit about the same 87-90% levels.&lt;BR /&gt;Then temporal store stream_triad reports a value of 55%-62% of theoretical bw. The actual bw is 4/3* reported bw (so actual (as measured by perf dram bw) is about 75%-82% of theoretical.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;It would be just so much easier (for me) if stream reported "Read:" bw in addition to the Triad, etc.&lt;BR /&gt;We report mem bw (via perf) on our cloud servers and we want to make sure users have the correct understanding of where their mem bw usage fits in the %theoretical bw curve.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;I know this might not be the right place for a "mainly stream" question but I wasn't sure how to contact Dr Bandwidth.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Long time since I've posted here or talked with Dr. McCalpin.&lt;BR /&gt;Patrick Fay&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 04 Mar 2021 04:15:46 GMT</pubDate>
    <dc:creator>pfay1</dc:creator>
    <dc:date>2021-03-04T04:15:46Z</dc:date>
    <item>
      <title>Modifying stream benchmark to report read bandwidth</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261261#M7829</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I'm using Dr. McCalpin's stream to measure bandwidth on various servers. It works great.&lt;/P&gt;
&lt;P&gt;I know that we can compile it with non-temporal stores and get higher reported bandwidth (since the read-for-ownership isn't done). But trying to explain this to other people who use stream and are not familiar with RFOs and non-temporal vs temporal and actual vs reported bw is hard.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And I know I can multiply the reported bw by a factor (4/3 for triad) and get a peak 'actual (as in "what linux perf would report as seen by dram")' bw... but this starts making some folks eyes glaze over.&lt;/P&gt;
&lt;P&gt;Other bw benchmarks (such as Intel's mlc) report pure read bw.&lt;/P&gt;
&lt;P&gt;Stream can be easily modified to do a pure read test (I changed the ARRAY_type to 'int' from 'double'). (and yes, I'm checking perf bw to verify that what stream_read reports is actually appearing as memory traffic). And yes I use a much larger size array to ensure I'm not hitting cache (I get about 99.5+ % L3 miss rate).&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;for (j=0; j&amp;lt;STREAM_ARRAY_SIZE; j++)&lt;BR /&gt;&lt;SPAN class="Apple-converted-space" style="font-family: inherit;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;iaccum += a[j];&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Is adding a stream_read subtest something Dr. McCalpin might consider for stream?&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;In general (for the servers I'm profiling anyway) about 10% of the theoretical bw is lost off the top (due to memory refresh, memory scrubbing and other stuff (that Dr Bandwidth know much better than I)).&lt;BR /&gt;Next my stream_read (and Intel mlc read bw) bandwidth can hit about 87-92% of theoretical bw.&lt;BR /&gt;Non-temporal store stream_triad can hit about the same 87-90% levels.&lt;BR /&gt;Then temporal store stream_triad reports a value of 55%-62% of theoretical bw. The actual bw is 4/3* reported bw (so actual (as measured by perf dram bw) is about 75%-82% of theoretical.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;It would be just so much easier (for me) if stream reported "Read:" bw in addition to the Triad, etc.&lt;BR /&gt;We report mem bw (via perf) on our cloud servers and we want to make sure users have the correct understanding of where their mem bw usage fits in the %theoretical bw curve.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;I know this might not be the right place for a "mainly stream" question but I wasn't sure how to contact Dr Bandwidth.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Long time since I've posted here or talked with Dr. McCalpin.&lt;BR /&gt;Patrick Fay&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Mar 2021 04:15:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261261#M7829</guid>
      <dc:creator>pfay1</dc:creator>
      <dc:date>2021-03-04T04:15:46Z</dc:date>
    </item>
    <item>
      <title>Re: Modifying stream benchmark to report read bandwidth</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261658#M7830</link>
      <description>&lt;P&gt;Here is a sample stream_read benchmark. It just adds a Read subtest to the other 4 subtests.&lt;BR /&gt;The read subtest just reads 1 double value from each cacheline (so every cacheline gets loaded from memory... but this is a departure from other subtests which processes all the doubles in a cacheline).&lt;BR /&gt;I compiled it with gcc cmd below (for a 16 GB per array size and use openmp):&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&lt;FONT face="courier new,courier"&gt;gcc stream_read.c -O3 -march=native -fno-builtin -DSTREAM_ARRAY_SIZE=2147483648 -mcmodel=medium -DNTIMES=20 -DOFFSET=0 -DSTREAM_TYPE=double -fopenmp -o stream_read.x&lt;/FONT&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Run it with:&lt;BR /&gt;&lt;/SPAN&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN class="s1"&gt;export OMP_NUM_THREADS=64&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;export GOMP_CPU_AFFINITY=0-63:1&lt;BR /&gt;./stream_read.x&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;
&lt;P&gt;It has output like (the "Read:" line is shows the read bandwidth.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT face="courier new,courier"&gt;&lt;SPAN class="s1"&gt;Function&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;Best Rate MB/s&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &lt;/SPAN&gt;Avg time &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;Min time &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;Max time&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;Copy:&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;108351.0 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.317234 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.317115 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.317402&lt;BR /&gt;&lt;SPAN style="font-family: inherit;"&gt;Scale: &lt;/SPAN&gt;&lt;SPAN class="Apple-converted-space" style="font-family: inherit;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;108182.5 &lt;/SPAN&gt;&lt;SPAN class="Apple-converted-space" style="font-family: inherit;"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;0.317923 &lt;/SPAN&gt;&lt;SPAN class="Apple-converted-space" style="font-family: inherit;"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;0.317609 &lt;/SPAN&gt;&lt;SPAN class="Apple-converted-space" style="font-family: inherit;"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN style="font-family: inherit;"&gt;0.318356&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;Add: &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;121554.2 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.424257 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.424005 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.424597&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;Triad: &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;121882.0 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.423358 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.422865 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.423755&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN class="s1"&gt;Read:&lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;180597.7 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.095204 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.095128 &lt;SPAN class="Apple-converted-space"&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;0.095340&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 05 Mar 2021 04:48:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261658#M7830</guid>
      <dc:creator>pfay1</dc:creator>
      <dc:date>2021-03-05T04:48:56Z</dc:date>
    </item>
    <item>
      <title>Re: Modifying stream benchmark to report read bandwidth</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261929#M7832</link>
      <description>&lt;P&gt;I certainly have a number of versions of STREAM that include read-only kernels (typically using either 1 array for DSUM or 2 arrays for DDOT). &amp;nbsp; They have never migrated into the standard version of STREAM because lots of compilers have trouble with optimization of the sum reductions. &amp;nbsp;This leads to lower performance than expected, and raises as many questions as it answers.&lt;/P&gt;
&lt;P&gt;It would be easier to add a DAXPY kernel -- like Triad, but updating one of the two input arrays. &amp;nbsp;This would get rid of the write-allocate traffic and perhaps make it more clear to folks what is going on....&lt;/P&gt;</description>
      <pubDate>Fri, 05 Mar 2021 22:56:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261929#M7832</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2021-03-05T22:56:29Z</dc:date>
    </item>
    <item>
      <title>Re: Modifying stream benchmark to report read bandwidth</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261973#M7833</link>
      <description>&lt;P&gt;Thanks John,&lt;BR /&gt;I'd love to see a dsum, daxpy or ddot that does pure read bandwidth.&lt;BR /&gt;That would be more in keeping with your other kernels (which do real-world-type work on the whole arrays).&lt;BR /&gt;I've attached a new version of stream_read.&lt;BR /&gt;The file stream_read_v03.c aliases the double array as an integer array and sums all the integer elements.&lt;BR /&gt;(I had a stream_read_v02.c attached but I've replaced that with stream_read_v03.c which adds a reduction() clause to the openmp parallel for).&lt;BR /&gt;So the integer operations are not in keeping with the spirit of your floating point kernels but it does process the whole array.&lt;BR /&gt;And it scales well across 32/48/80/96 cpu systems.&lt;BR /&gt;This version also prints the "&lt;SPAN class="s1"&gt;based on a variant of the STREAM benchmark code" message as your license requires.&lt;BR /&gt;Pat&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 06 Mar 2021 07:56:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1261973#M7833</guid>
      <dc:creator>pfay1</dc:creator>
      <dc:date>2021-03-06T07:56:47Z</dc:date>
    </item>
    <item>
      <title>Re: Modifying stream benchmark to report read bandwidth</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1270102#M7852</link>
      <description>&lt;P&gt;I've posted the stream_read.c and a run script to github:&amp;nbsp;&lt;A href="https://github.com/patinnc/stream_read" target="_blank"&gt;https://github.com/patinnc/stream_read&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Apr 2021 21:18:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/Modifying-stream-benchmark-to-report-read-bandwidth/m-p/1270102#M7852</guid>
      <dc:creator>pfay1</dc:creator>
      <dc:date>2021-04-01T21:18:30Z</dc:date>
    </item>
  </channel>
</rss>

