Need help: Memory bound analysis

roderickHuang · ‎11-13-2024

I've developped a rail-way detecting data playback program and now try analysis performance, and resolve performance bottleneck.

I'm trying resolve a memory bound when traversing the data points which are stored in an array, the data points defined as shown below:

struct sample_data_t {
  int32_t cid_{};                 //!< sample data's Channel's ID.
  point_double_t mapped_pos_{};   //!< sample data's mapped position.
  point_double_t disp_pos_{};     //!< sample data's display position.
  point_double_t sample_data_{};  //!< sample data's raw data.
  int32_t dbg_blk_id_{-1}, sid_{};
};

struct sample_dataset_t {
  std::vector<sample_data_t> ds_;  //!< sample dataset's sample data vector.
};

The traversing code list below (cur_ds_slice_ is a sub-range of std::vector<sample_data_t>):

// translate mapped-xy to display-xy
  using ch_data_t = std::vector<test::gl_painter::point_double_t>;
  std::vector<ch_data_t> disp_data_set(data_paints_.size());
  {
    std::vector<size_t> ch_data_num(data_paints_.size(), 0);
    for (auto& sdata : cur_ds_slice_) {  // std::count_if
      ch_data_num[sdata.cid_]++;
    }
    for (size_t i = 0; i < data_paints_.size(); i++) {
      disp_data_set[i].reserve(ch_data_num[i]);
    }
  }

It seems that the measured traversing loop's memory bound is 90.1% (The assembly at address 0x9ebf3) :

I'm confused

1. Why there's so high value of memory bound since using sequencial traverse ?

2. Why the loop not unrolled as SIMD instructions ?

Note1: sizeof(sample_data_t) = 0x40

Note2: The test CPU is Intel(R) Core(TM) i5-8260U

Note3: The program is compiled as Release and more options list below:

if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
  target_link_libraries(${target_name} PUBLIC pthread dl)

  if(ENABLE_PROFILER)
    target_compile_options(${target_name} PRIVATE -fno-omit-frame-pointer -g)
    target_link_options(${target_name} PRIVATE -fno-omit-frame-pointer)
  endif()
endif()

yuzhang3_intel · ‎11-14-2024

If you use GCC, you can try using the -march option for vectorization, like -march=native, or you can use the vectorization report option to check why the current code is not vectorized.

roderickHuang · ‎11-17-2024

Thanks yuzhang,

I tried compile using "-march=native". There's no difference.

Actually this hot line take little time in all the Run time (5.6 seconds of all CPU time 186.2 seconds). I just curious is it possible eliminate the memory read latency of this line.

After reduce the struct size by remove optional data member to 32Bytes (also potential false-sharing), it still have high memory bound:(address 0xa08a3)

yuzhang3_intel · ‎11-18-2024

I think you need to review your source code to see if there is an opportunity to vectorize the add operations. Theoretically, add operations in a loop can be vectorized without dependencies(-fno-alias). You can also try using __builtin_prefetch() to prefetch data into cache in advance to reduce memory access latency.