topic Re: Need help: Memory bound analysis in Analyzers

Need help: Memory bound analysis

roderickHuang — Thu, 14 Nov 2024 00:14:22 GMT

I've developped a rail-way detecting data playback program and now try analysis performance, and resolve performance bottleneck.

I'm trying resolve a memory bound when traversing the data points which are stored in an array, the data points defined as shown below:

struct sample_data_t { int32_t cid_{}; //!< sample data's Channel's ID. point_double_t mapped_pos_{}; //!< sample data's mapped position. point_double_t disp_pos_{}; //!< sample data's display position. point_double_t sample_data_{}; //!< sample data's raw data. int32_t dbg_blk_id_{-1}, sid_{}; }; struct sample_dataset_t { std::vector<sample_data_t> ds_; //!< sample dataset's sample data vector. };

The traversing code list below (cur_ds_slice_ is a sub-range of std::vector<sample_data_t>):

// translate mapped-xy to display-xy using ch_data_t = std::vector<test::gl_painter::point_double_t>; std::vector<ch_data_t> disp_data_set(data_paints_.size()); { std::vector<size_t> ch_data_num(data_paints_.size(), 0); for (auto& sdata : cur_ds_slice_) { // std::count_if ch_data_num[sdata.cid_]++; } for (size_t i = 0; i < data_paints_.size(); i++) { disp_data_set[i].reserve(ch_data_num[i]); } }

It seems that the measured traversing loop's memory bound is 90.1% (The assembly at address 0x9ebf3) :

I'm confused

1. Why there's so high value of memory bound since using sequencial traverse ?

2. Why the loop not unrolled as SIMD instructions ?

Note1: sizeof(sample_data_t) = 0x40

Note2: The test CPU is Intel(R) Core(TM) i5-8260U

Note3: The program is compiled as Release and more options list below:

if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU") target_link_libraries(${target_name} PUBLIC pthread dl) if(ENABLE_PROFILER) target_compile_options(${target_name} PRIVATE -fno-omit-frame-pointer -g) target_link_options(${target_name} PRIVATE -fno-omit-frame-pointer) endif() endif()

Re: Need help: Memory bound analysis

yuzhang3_intel — Thu, 14 Nov 2024 09:21:42 GMT

If you use GCC, you can try using the -march option for vectorization, like -march=native, or you can use the vectorization report option to check why the current code is not vectorized.

Re: Need help: Memory bound analysis

roderickHuang — Mon, 18 Nov 2024 07:01:14 GMT

Thanks yuzhang,

I tried compile using "-march=native". There's no difference.

Actually this hot line take little time in all the Run time (5.6 seconds of all CPU time 186.2 seconds). I just curious is it possible eliminate the memory read latency of this line.

After reduce the struct size by remove optional data member to 32Bytes (also potential false-sharing), it still have high memory bound:(address 0xa08a3)

Re: Need help: Memory bound analysis

yuzhang3_intel — Mon, 18 Nov 2024 09:19:27 GMT

I think you need to review your source code to see if there is an opportunity to vectorize the add operations. Theoretically, add operations in a loop can be vectorized without dependencies(-fno-alias). You can also try using __builtin_prefetch() to prefetch data into cache in advance to reduce memory access latency.