I'm using vtune to spot bottlenecks in a piece of code that looks like the following:
for (int i = 0; i < X; i++)
for (int j = 0; j < Y; j++)
for (int k = 0; k < Z; k++)
The code is automatically generated by a high-level tool, that's why it looks "weird". I'm using the most recent intel suite (compiler and tools).
In a specific run (there is no significant variation in results among different runs), Vtune says loads blocked by store forwarding constitue a significant proportion of the execution time, equal to roughly 0.160. If I look at the assembly/source view, it appears that the load-sum-store on A is the main responsible of that:
1) I can't manage to understand why. A is *never* used for the right hand side computation, so why should the problem occur?
In addition, most of the instructions (even, for example, those computing the other products/sums in the right hand side) seem to be affected by "loads blocked after store forwarding". Which is actually weird. Although they just use registers to perform the product/sum itself, the vtune analysis says they are "blocked due to store forwarding" (i.e. the coloumn "loads blocked by store forwarding" is different from 0).
2) Could you explain this?
Thank you very much for considering my request
Completely forgot about that, sorry:
the reporter confirms the vectorisation of the loop and tells that some loops have been unrolled - what exactly would you like to know from the reporter?
Other thing I was forgetting is that the store of A seems to take a huge fraction of the execution time. I guess it's because it needs to wait for the previous load-sum to be ultimated, but is it possible the compiler can't optimise that? theoretically that store should be issued and the execution should proceed without particular problems (i.e. overlapping of communication with the memory and cpu computation)
I was curious about whether the compiler might have interchanged or fused loops, as well as the points on which you reported. If the bottleneck is in the cache hierarchy, AVX-256 may not be advantageous. Stabbing in the dark, it might be interesting to test the effect of #pragma vector unaligned or #pragma vector nontemporal on the inner loop.
I think that there is a lot of data dependecies in the code.Part of the innermost loop arrays needs to wait for completion of the the other part of the calculation.Unfortunately those I array invariants could not be put outside of the loop.
@illyapolak I'm not following you. At the src code level, there are no data dependencies - all j,k iterations are indipendent.
In particular, I didn't quite get what you mean by "Part of the innermost loop arrays needs to wait for completion of the the other part of the calculation"
There is a lot of loads going through two load/store ports and unless all of those arrays are prefetched and are present in L1 cache there will be some delay.
I still keep forgetting important infos: data usually fit the L1 cache - in the case I described, they definitely fit the L1 cache.
Despite being in L1 cache, I guess data are affected anyway by the dependence you're mentioning - it's actually something I thought about - but I wouldn't classify it as "loads blocked due to store forwarding", which according to vtune is the main cause of performance degradation.
I also do not iknow why VTune logic decided to call it "loads blocked due to store forwarding".If I understood it correctly such a behaviour can be related to processor which is not able to resolve(calculate) stores when fetching next load.