Quote:iliyapolak wrote:

Marcin_K_ · ‎03-05-2014

Hello,

In Agner Fog's excellent microarchitecture.pdf (section 9.14) I read that:

Store forwarding works in the following cases: [...] When a write of 128 or 256 bits is followed by a read of the same size and the same address, aligned by 16.

On the other hand, Intel's Architecture Optimization Reference Manual (2.2.5.2 Intel Sandy Bridge, L1 DCache) I read that

Stores cannot forward to loads in the following cases: [...] Any load that crosses a 16-byte boundary of a 32-byte store.

It would seem that a 32-byte load does cross 16-byte boundary, so it should not be forwarded. However, table 2.16, section 2.2.5.2 does indicate that forwarding takes place for a 32 byte store/load when the load is from the same address.

I wrote the following simple code to test this, and it seems that 32 byte stores have a small penalty when forwarded to subsequent 32 byte loads on the Sandy Bridge (and Ivy Bridge) architecture. Here is the code:

#include <stdlib.h>
#include <malloc.h>

int main(){

  long i;

  // aligned memory address
  double *tempa = (double*)memalign(4096, sizeof(double)*4);
  for(i=0; i<4; i++) tempa = 1.0;

  for(i=0; i<1000000000; i++){ // 1e9 iterations

#ifdef TEST_AVX
    __asm__("vmovapd    %%ymm12, (%0)\n\t"
            "vmovapd    (%0), %%ymm13\n\t"
        :
        :"r"(tempa));
#else
    __asm__("movapd %%xmm12, (%0)\n\t"
            "movapd (%0), %%xmm13\n\t"
            :
            :"r"(tempa));
#endif
  }
}

Compiled with gcc -O3 (-DTEST_AVX). Analyzing the loop body using IACA (throughput, -arch SNB) I get the following results for SSE2 and AVX cases, respectively:

SSE2 case

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   2^   |           |     | 0.5       | 0.5       | 1.0 |     | CP | vmovapd xmmword ptr [rax], xmm12
|   1    |           |     | 0.5   0.5 | 0.5   0.5 |     |     | CP | vmovapd xmm13, xmmword ptr [rax]
|   1    |           |     |           |           |     | 1.0 | CP | sub r14d, 0x1
|   0F   |           |     |           |           |     |     |    | jnz 0xfffffffffffffff4

AVX case

| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   2^   |           |     | 0.5       | 0.5       | 2.0 |     | CP | vmovapd ymmword ptr [rax], ymm12
|   1    |           |     | 0.5   1.0 | 0.5   1.0 |     |     |    | vmovapd ymm13, ymmword ptr [rax]
|   1    |           |     |           |           |     | 1.0 |    | sub r14d, 0x1
|   0F   |           |     |           |           |     |     |    | jnz 0xfffffffffffffff4

First observation is that there is a difference in port 4 - first instruction of the AVX code dispatches 2 uops. When running the code with performance counters, the number of uops dispatched to individual ports agree for the SSE code, but DOES NOT for the AVX case. During execution only one, NOT TWO uops are dispatched to port 4 (or the counter only reports 1), just as in the SSE2 scenario. On the other hand, the measured number of cycles per iteration is 1 in the SSE scenario, and 2 in the AVX scenario.

Moreover, in the AVX case I count one store-forwarding block event per iteration (counter 03H 02H LD_BLOCKS.STORE_FORWARD). The counter reads 0 for the SSE2 case. So it seems that there are a number of problems here:

IACA reports a higher number of uops being executed than the hardware actually executes, or the hw counter does not report all uops dispatched to port 4.
IACA correctly predicts that the loop body takes 2 cycles for the AVX case and 1 cycle for the SSE case, but possibly for wrong reasons.
For 32-byte loads/stores LD_BLOCKS.STORE_FORWARD is increased once per iteration, although according to documentation this is a successful store forwarding scenario.
The apparent penalty for this store-forward block is only 1 cycle, not ~12 cycles reported by Agner (and I think also Intel documentation)

Now, in a real application spilling ymm registers to memory can happen quite often. This produces a lot of 'spurious' LD_BLOCKS.STORE_FORWARD events, which is certainly worrying. Is this a real problem though? 1 clock penalty measured in the above test may not be too much, but is it always 1 clock?

Marcin_K_ · ‎03-05-2014

Of course now it occurred to me that on SNB/IVB reading and writing of a 32-byte register to memory takes 2 cycles, so I guess that explains the performance difference and there is no penalty for blocked store-forwarding. However, the question about the performance counters and IACA still remains..

Richard_Nutman · ‎03-10-2014

Hi Marcin,

I experienced similar results a while back on Sandy Bridge, when doing work with AVX code. VTune reporting loads blocked due to store forwarding issues, but there was no places where that should've occured. Perhaps the false triggering of those performance counters is what I was seeing as well.

Bernard · ‎03-11-2014

>>> Perhaps the false triggering of those performance counters is what I was seeing as well.>>>

Could be related to some issue either with the microcode or with the hardware(counter(s) logic) itself.

Marcin_K_ · ‎03-11-2014

iliyapolak wrote:

>>> Perhaps the false triggering of those performance counters is what I was seeing as well.>>>

Could be related to some issue either with the microcode or with the hardware(counter(s) logic) itself.

Seems like a problem with the counters. The code works with the expected performance, so I gather there should be no blocks reported.

Bernard · ‎03-11-2014

>>>Seems like a problem with the counters>>>

Probably yes.In the case described by you my educated guess is that counter logic(physical register) will be incremented probably on every? or some of the load uops being present in scheduler.Now I do not know if the microcode and hardware logic are evaluating correctly the LD_BLOCKS.STORE_FORWARD events.

32 byte store to load forwarding on Sandy Bridge