Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Loads blocked due to store forwarding

FabioL_
Beginner
526 Views

Hi all

I'm using vtune to spot bottlenecks in a piece of code that looks like the following:

for (int i = 0; i < X; i++) 
  for (int j = 0; j < Y; j++) 
    for (int k = 0; k < Z; k++) 
      A += (FE0*FE0*I[0] + B1*C1*I[1] + B2*C2*I[2] + ...);

The code is automatically generated by a high-level tool, that's why it looks "weird". I'm using the most recent intel suite (compiler and tools).

In a specific run (there is no significant variation in results among different runs), Vtune says loads blocked by store forwarding constitue a significant proportion of the execution time, equal to roughly 0.160. If I look at the assembly/source view, it appears that the load-sum-store on A is the main responsible of that:

1) I can't manage to understand why. A is *never* used  for the right hand side computation, so why should the problem occur?

In addition, most of the instructions (even, for example, those computing the other products/sums in the right hand side) seem to be affected by "loads blocked after store forwarding". Which is actually weird. Although they just use registers to perform the product/sum itself, the vtune analysis says they are "blocked due to store forwarding" (i.e. the coloumn "loads blocked by store forwarding" is different from 0).

2) Could you explain this?

Thank you very much for considering my request

0 Kudos
9 Replies
TimP
Honored Contributor III
526 Views

Which compile options do you use?  What do the opt-reports say?

0 Kudos
FabioL_
Beginner
526 Views

Completely forgot about that, sorry:

-O3, -xAVX 

the reporter confirms the vectorisation of the loop and tells that some loops have been unrolled - what exactly would you like to know from the reporter?

Other thing I was forgetting is that the store of A seems to take a huge fraction of the execution time. I guess it's because it needs to wait for the previous load-sum to be ultimated, but is it possible the compiler can't optimise that? theoretically that store should be issued and the execution should proceed without particular problems (i.e. overlapping of communication with the memory and cpu computation)

Thanks 

0 Kudos
TimP
Honored Contributor III
526 Views

I was curious about whether the compiler might have interchanged or fused loops, as well as the points on which you reported.  If the bottleneck is in the cache hierarchy, AVX-256 may not be advantageous.  Stabbing in the dark, it might be interesting to test the effect of #pragma vector unaligned or #pragma vector nontemporal on the inner loop.

0 Kudos
Bernard
Valued Contributor I
526 Views

I think that there is a lot of data dependecies in the code.Part of the innermost loop arrays needs to wait for completion of the the other part of the calculation.Unfortunately those I[] array invariants could not be put outside of the loop.

0 Kudos
FabioL_
Beginner
526 Views

@illyapolak I'm not following you. At the src code level, there are no data dependencies - all j,k iterations are indipendent. 

In particular, I didn't quite get what you mean by "Part of the innermost loop arrays needs to wait for completion of the the other part of the calculation"

Thanks

0 Kudos
Bernard
Valued Contributor I
526 Views

There is a lot of loads going through two load/store ports and unless all of those arrays are prefetched and are present in L1 cache there will be some delay.

(FE0*FE0*I[0] + B1*C1*I[1] + B2*C2*I[2] + ...); this long array based calculation of sum/products depends on each other and cannot effectively exploit instruction level parallelism

0 Kudos
Bernard
Valued Contributor I
526 Views

I meant at lowest level of utilization of execution ports

0 Kudos
FabioL_
Beginner
526 Views

Hi

I still keep forgetting important infos: data usually fit the L1 cache - in the case I described, they definitely fit the L1 cache. 

Despite being in L1 cache, I guess data are affected anyway by the dependence you're mentioning - it's actually something I thought about - but I wouldn't classify it as "loads blocked due to store forwarding", which according to vtune is the main cause of performance degradation.

Thank you 

-- Fabio

0 Kudos
Bernard
Valued Contributor I
526 Views

I also do not iknow why VTune logic decided to call it "loads blocked due to store forwarding".If I understood it correctly such a behaviour can be related to processor which is not able to resolve(calculate) stores when fetching next load.

0 Kudos
Reply