Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
5247 ディスカッション

How to use Intel VTune to analyze the cause of unstable performance

Xiaoqiang
新規コントリビューター I
2,711件の閲覧回数

I posted an issue with unstable performance on the Intel Community.

Intel IPP library performance is unstable, 2x performance difference between 2 runs. - Intel Community

 

The VTune tool did find that the pipeline had a Back-End Bound during randomly poor performance situations.

Can the VTune tool help me find the reason behind this scene?

If so, what should I do next to use the Vtune tool.

 

Xiaoqiang_0-1755249438609.png

 

ラベル(1)
0 件の賞賛
1 解決策
e87tn95h
新規コントリビューター I
2,469件の閲覧回数

Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:

__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};

Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.

元の投稿で解決策を見る

4 返答(返信)
e87tn95h
新規コントリビューター I
2,470件の閲覧回数

Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:

__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};

Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.

Xiaoqiang
新規コントリビューター I
2,370件の閲覧回数

Thanks for your help.

The performance is stable with the addition of the alignment directive.

Why the alignment directive has such a big impact on performance?

optimizergal
初心者
2,337件の閲覧回数

If store and load instructions for the same data exists close enough in the pipeline, the CPU can skip storing it to cache and just forward it directly to the load (store forwarding). This can save several CPU cycles. There are different issues with alignment, but basically the store forwarding may not work if the data isn't aligned to the right memory address. It might end up being aligned without using the attribute, but it might not, and I think that's why you see different performance with the same code. The attribute guarantees that it will be aligned every time.

 

As for VTune, you can go to the Bottom-up tab and see the list of functions along with the TMA metrics for each function. If you find the one with high Loads Blocked from Store Forwarding, you can double-click on the function and bring up the source view. Depending on certain configurations, you can see the line of code affected by the stall, and then see whether the variable is aligned or a smaller size than the read. 

e87tn95h
新規コントリビューター I
2,321件の閲覧回数

Kudos to you for such a great explanation!

返信