Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5246 Discussions

How to use Intel VTune to analyze the cause of unstable performance

Xiaoqiang
New Contributor I
2,534 Views

I posted an issue with unstable performance on the Intel Community.

Intel IPP library performance is unstable, 2x performance difference between 2 runs. - Intel Community

 

The VTune tool did find that the pipeline had a Back-End Bound during randomly poor performance situations.

Can the VTune tool help me find the reason behind this scene?

If so, what should I do next to use the Vtune tool.

 

Xiaoqiang_0-1755249438609.png

 

Labels (1)
0 Kudos
1 Solution
e87tn95h
New Contributor I
2,292 Views

Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:

__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};

Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.

View solution in original post

4 Replies
e87tn95h
New Contributor I
2,293 Views

Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:

__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};

Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.

Xiaoqiang
New Contributor I
2,194 Views

Thanks for your help.

The performance is stable with the addition of the alignment directive.

Why the alignment directive has such a big impact on performance?

0 Kudos
optimizergal
Novice
2,161 Views

If store and load instructions for the same data exists close enough in the pipeline, the CPU can skip storing it to cache and just forward it directly to the load (store forwarding). This can save several CPU cycles. There are different issues with alignment, but basically the store forwarding may not work if the data isn't aligned to the right memory address. It might end up being aligned without using the attribute, but it might not, and I think that's why you see different performance with the same code. The attribute guarantees that it will be aligned every time.

 

As for VTune, you can go to the Bottom-up tab and see the list of functions along with the TMA metrics for each function. If you find the one with high Loads Blocked from Store Forwarding, you can double-click on the function and bring up the source view. Depending on certain configurations, you can see the line of code affected by the stall, and then see whether the variable is aligned or a smaller size than the read. 

0 Kudos
e87tn95h
New Contributor I
2,144 Views

Kudos to you for such a great explanation!

0 Kudos
Reply