- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've developped a rail-way detecting data playback program and now try analysis performance, and resolve performance bottleneck.
I'm trying resolve a memory bound when traversing the data points which are stored in an array, the data points defined as shown below:
struct sample_data_t {
int32_t cid_{}; //!< sample data's Channel's ID.
point_double_t mapped_pos_{}; //!< sample data's mapped position.
point_double_t disp_pos_{}; //!< sample data's display position.
point_double_t sample_data_{}; //!< sample data's raw data.
int32_t dbg_blk_id_{-1}, sid_{};
};
struct sample_dataset_t {
std::vector<sample_data_t> ds_; //!< sample dataset's sample data vector.
};
The traversing code list below (cur_ds_slice_ is a sub-range of std::vector<sample_data_t>):
// translate mapped-xy to display-xy
using ch_data_t = std::vector<test::gl_painter::point_double_t>;
std::vector<ch_data_t> disp_data_set(data_paints_.size());
{
std::vector<size_t> ch_data_num(data_paints_.size(), 0);
for (auto& sdata : cur_ds_slice_) { // std::count_if
ch_data_num[sdata.cid_]++;
}
for (size_t i = 0; i < data_paints_.size(); i++) {
disp_data_set[i].reserve(ch_data_num[i]);
}
}
It seems that the measured traversing loop's memory bound is 90.1% (The assembly at address 0x9ebf3) :
I'm confused
1. Why there's so high value of memory bound since using sequencial traverse ?
2. Why the loop not unrolled as SIMD instructions ?
Note1: sizeof(sample_data_t) = 0x40
Note2: The test CPU is Intel(R) Core(TM) i5-8260U
Note3: The program is compiled as Release and more options list below:
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "GNU")
target_link_libraries(${target_name} PUBLIC pthread dl)
if(ENABLE_PROFILER)
target_compile_options(${target_name} PRIVATE -fno-omit-frame-pointer -g)
target_link_options(${target_name} PRIVATE -fno-omit-frame-pointer)
endif()
endif()
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you use GCC, you can try using the -march option for vectorization, like -march=native, or you can use the vectorization report option to check why the current code is not vectorized.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks yuzhang,
I tried compile using "-march=native". There's no difference.
Actually this hot line take little time in all the Run time (5.6 seconds of all CPU time 186.2 seconds). I just curious is it possible eliminate the memory read latency of this line.
After reduce the struct size by remove optional data member to 32Bytes (also potential false-sharing), it still have high memory bound:(address 0xa08a3)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think you need to review your source code to see if there is an opportunity to vectorize the add operations. Theoretically, add operations in a loop can be vectorized without dependencies(-fno-alias). You can also try using __builtin_prefetch() to prefetch data into cache in advance to reduce memory access latency.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page