Sustained 2r1w/cycle GPR code on Skylake architecture

Jens_N_ · ‎02-28-2017

Hello all,

Considering the following code (symbolic asm) on the Intel Skylake architecture

align 16
.Loop1:

    mov     Limb0, [Op1]
    adc     Limb0, [Op2]
    mov     [Op3], Limb0

    mov     Limb0, [Op1+8]
    adc     Limb0, [Op2+8]
    mov     [Op3+8], Limb0
    mov     Limb0, [Op1+16]
    adc     Limb0, [Op2+16]
    mov     [Op3+16], Limb0
    mov     Limb0, [Op1+24]
    adc     Limb0, [Op2+24]
    mov     [Op3+24], Limb0
    mov     Limb0, [Op1+32]
    adc     Limb0, [Op2+32]
    mov     [Op3+32], Limb0
    mov     Limb0, [Op1+40]
    adc     Limb0, [Op2+40]
    mov     [Op3+40], Limb0
    mov     Limb0, [Op1+48]
    adc     Limb0, [Op2+48]
    mov     [Op3+48], Limb0
    mov     Limb0, [Op1+56]
    adc     Limb0, [Op2+56]
    mov     [Op3+56], Limb0

    lea     Op1, [Op1+64]
    lea     Op2, [Op2+64]
    lea     Op3, [Op3+64]

dec Size
jne .Loop1

I do have the following observations

* for low values of Size (e.g. <16) the core is capable of delivering 2r1w/cycle that is the execution time in cyles is in line with the number of additions

* for large values of Size (e.g. >=256) the average execution time per addition goes up to 1.21 cycles per addition

* I found no way to leverage this by shuffling operation sequences, changing unroll sizes, etc.

* further analysis showed that the write does not always ends up on issue port 7 as it should (in the optimal case)

* when inserting a "VPOR YMM0, YMM0, YMM0" between first and second addition the execution time goes to 1 cycle per addition reliably for any value of Size - this only works with AVX opcodes, not with SSE

* it looks like the microop dispatcher for GPR gets "confused" on longer sequences and reset by activating the AVX dispatcher once in a while

Question:

* is this known behaviour or just a curious side effect

Kind regards,
MathMan