- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I posted an issue with unstable performance on the Intel Community.
The VTune tool did find that the pipeline had a Back-End Bound during randomly poor performance situations.
Can the VTune tool help me find the reason behind this scene?
If so, what should I do next to use the Vtune tool.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:
__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};
Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:
__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};
Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your help.
The performance is stable with the addition of the alignment directive.
Why the alignment directive has such a big impact on performance?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If store and load instructions for the same data exists close enough in the pipeline, the CPU can skip storing it to cache and just forward it directly to the load (store forwarding). This can save several CPU cycles. There are different issues with alignment, but basically the store forwarding may not work if the data isn't aligned to the right memory address. It might end up being aligned without using the attribute, but it might not, and I think that's why you see different performance with the same code. The attribute guarantees that it will be aligned every time.
As for VTune, you can go to the Bottom-up tab and see the list of functions along with the TMA metrics for each function. If you find the one with high Loads Blocked from Store Forwarding, you can double-click on the function and bring up the source view. Depending on certain configurations, you can see the line of code affected by the stall, and then see whether the variable is aligned or a smaller size than the read.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page