<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Need help: Memory bound analysis in Analyzers</title>
    <link>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643867#M25393</link>
    <description>&lt;P&gt;Thanks yuzhang,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried compile using "-march=native". There's no difference.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Actually this hot line take little time in all the Run time (5.6 seconds of all CPU time 186.2 seconds). I just curious is it possible &lt;SPAN&gt;eliminate&lt;/SPAN&gt; the memory read latency of this line.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;After reduce the struct size by remove optional data member to 32Bytes (also potential false-sharing), it still have high memory bound:(address&amp;nbsp;&lt;STRONG&gt;0xa08a3&lt;/STRONG&gt;)&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="未命名.png" style="width: 766px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/60433iE8E161955A99FC3C/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="未命名.png" alt="未命名.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 18 Nov 2024 07:01:14 GMT</pubDate>
    <dc:creator>roderickHuang</dc:creator>
    <dc:date>2024-11-18T07:01:14Z</dc:date>
    <item>
      <title>Need help: Memory bound analysis</title>
      <link>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1642939#M25382</link>
      <description>&lt;P&gt;I've developped a rail-way detecting data playback program and now try analysis performance, and resolve performance bottleneck.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm trying resolve a &lt;STRONG&gt;memory bound&lt;/STRONG&gt; when traversing the data points which are stored in an array, the data points defined as shown below:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;struct sample_data_t {
  int32_t cid_{};                 //!&amp;lt; sample data's Channel's ID.
  point_double_t mapped_pos_{};   //!&amp;lt; sample data's mapped position.
  point_double_t disp_pos_{};     //!&amp;lt; sample data's display position.
  point_double_t sample_data_{};  //!&amp;lt; sample data's raw data.
  int32_t dbg_blk_id_{-1}, sid_{};
};

struct sample_dataset_t {
  std::vector&amp;lt;sample_data_t&amp;gt; ds_;  //!&amp;lt; sample dataset's sample data vector.
};&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;The traversing code list below (cur_ds_slice_ is a sub-range of std::vector&amp;lt;sample_data_t&amp;gt;):&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="cpp"&gt;// translate mapped-xy to display-xy
  using ch_data_t = std::vector&amp;lt;test::gl_painter::point_double_t&amp;gt;;
  std::vector&amp;lt;ch_data_t&amp;gt; disp_data_set(data_paints_.size());
  {
    std::vector&amp;lt;size_t&amp;gt; ch_data_num(data_paints_.size(), 0);
    for (auto&amp;amp; sdata : cur_ds_slice_) {  // std::count_if
      ch_data_num[sdata.cid_]++;
    }
    for (size_t i = 0; i &amp;lt; data_paints_.size(); i++) {
      disp_data_set[i].reserve(ch_data_num[i]);
    }
  }&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;It seems that the measured traversing loop's memory bound is 90.1% (&lt;STRONG&gt;The assembly at address 0x9ebf3&lt;/STRONG&gt;) :&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="perf_screenshot.png" style="width: 999px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/60288i50A7706A0EE10850/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="perf_screenshot.png" alt="perf_screenshot.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I'm confused&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp;1. Why there's so high value of memory bound since using sequencial traverse ?&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; 2. Why the loop not unrolled as SIMD instructions ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Note1: sizeof(sample_data_t) = 0x40&lt;/P&gt;&lt;P&gt;Note2: The test CPU is&amp;nbsp;Intel(R) Core(TM) i5-8260U&lt;/P&gt;&lt;P&gt;Note3: The program is compiled as Release and more options list below:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="bash"&gt;if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
  target_link_libraries(${target_name} PUBLIC pthread dl)

  if(ENABLE_PROFILER)
    target_compile_options(${target_name} PRIVATE -fno-omit-frame-pointer -g)
    target_link_options(${target_name} PRIVATE -fno-omit-frame-pointer)
  endif()
endif()&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Nov 2024 00:14:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1642939#M25382</guid>
      <dc:creator>roderickHuang</dc:creator>
      <dc:date>2024-11-14T00:14:22Z</dc:date>
    </item>
    <item>
      <title>Re: Need help: Memory bound analysis</title>
      <link>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643188#M25384</link>
      <description>&lt;P&gt;If you use GCC, you can try using the -march option for vectorization, like&amp;nbsp;-march=native, or you can use the &lt;SPAN&gt;vectorization report option to check why the current code is not vectorized.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 14 Nov 2024 09:21:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643188#M25384</guid>
      <dc:creator>yuzhang3_intel</dc:creator>
      <dc:date>2024-11-14T09:21:42Z</dc:date>
    </item>
    <item>
      <title>Re: Need help: Memory bound analysis</title>
      <link>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643867#M25393</link>
      <description>&lt;P&gt;Thanks yuzhang,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I tried compile using "-march=native". There's no difference.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Actually this hot line take little time in all the Run time (5.6 seconds of all CPU time 186.2 seconds). I just curious is it possible &lt;SPAN&gt;eliminate&lt;/SPAN&gt; the memory read latency of this line.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;After reduce the struct size by remove optional data member to 32Bytes (also potential false-sharing), it still have high memory bound:(address&amp;nbsp;&lt;STRONG&gt;0xa08a3&lt;/STRONG&gt;)&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="未命名.png" style="width: 766px;"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/60433iE8E161955A99FC3C/image-size/large/is-moderation-mode/true?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="未命名.png" alt="未命名.png" /&gt;&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 07:01:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643867#M25393</guid>
      <dc:creator>roderickHuang</dc:creator>
      <dc:date>2024-11-18T07:01:14Z</dc:date>
    </item>
    <item>
      <title>Re: Need help: Memory bound analysis</title>
      <link>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643899#M25394</link>
      <description>&lt;P&gt;I think you need to review your source code to see if there is an opportunity to vectorize the add operations. Theoretically, add operations in a loop can be vectorized without dependencies(-fno-alias). You can also try using &lt;STRONG&gt;__builtin_prefetch() &lt;/STRONG&gt;to prefetch data into cache in advance to reduce memory access latency.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 18 Nov 2024 09:19:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Analyzers/Need-help-Memory-bound-analysis/m-p/1643899#M25394</guid>
      <dc:creator>yuzhang3_intel</dc:creator>
      <dc:date>2024-11-18T09:19:27Z</dc:date>
    </item>
  </channel>
</rss>

