Solved: Re: How to use Intel VTune to analyze the cause of unstable performance

Xiaoqiang · ‎08-15-2025

I posted an issue with unstable performance on the Intel Community.

Intel IPP library performance is unstable, 2x performance difference between 2 runs. - Intel Community

The VTune tool did find that the pipeline had a Back-End Bound during randomly poor performance situations.

Can the VTune tool help me find the reason behind this scene?

If so, what should I do next to use the Vtune tool.

e87tn95h · ‎08-17-2025

Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:

__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};

Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.

View solution in original post

e87tn95h · ‎08-17-2025

Regarding line 22 of the test program ippsAddProduct_32fc.c from the original post, I believe it would be beneficial to add the following alignment directive for GCC 7:

__attribute__ ((aligned (64))) Ipp32fc des[32] = {0};

Since this destination accumulator vector is used for both reading (load) and writing (store) operations, alignment-related issues could potentially occur at runtime if the alignment is not explicitly specified.

Xiaoqiang · ‎08-19-2025

Thanks for your help.

The performance is stable with the addition of the alignment directive.

Why the alignment directive has such a big impact on performance?

optimizergal · ‎08-19-2025

If store and load instructions for the same data exists close enough in the pipeline, the CPU can skip storing it to cache and just forward it directly to the load (store forwarding). This can save several CPU cycles. There are different issues with alignment, but basically the store forwarding may not work if the data isn't aligned to the right memory address. It might end up being aligned without using the attribute, but it might not, and I think that's why you see different performance with the same code. The attribute guarantees that it will be aligned every time.

As for VTune, you can go to the Bottom-up tab and see the list of functions along with the TMA metrics for each function. If you find the one with high Loads Blocked from Store Forwarding, you can double-click on the function and bring up the source view. Depending on certain configurations, you can see the line of code affected by the stall, and then see whether the variable is aligned or a smaller size than the read.

e87tn95h · ‎08-19-2025

Kudos to you for such a great explanation!

How to use Intel VTune to analyze the cause of unstable performance

Intel VTune™ Profiler