Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Sustained 2r1w/cycle GPR code on Skylake architecture

Jens_N_
Beginner
159 Views

Hello all,

Considering the following code (symbolic asm) on the Intel Skylake architecture

    align   16
  .Loop1:

    mov     Limb0, [Op1]
    adc     Limb0, [Op2]
    mov     [Op3], Limb0

    mov     Limb0, [Op1+8]
    adc     Limb0, [Op2+8]
    mov     [Op3+8], Limb0
    mov     Limb0, [Op1+16]
    adc     Limb0, [Op2+16]
    mov     [Op3+16], Limb0
    mov     Limb0, [Op1+24]
    adc     Limb0, [Op2+24]
    mov     [Op3+24], Limb0
    mov     Limb0, [Op1+32]
    adc     Limb0, [Op2+32]
    mov     [Op3+32], Limb0
    mov     Limb0, [Op1+40]
    adc     Limb0, [Op2+40]
    mov     [Op3+40], Limb0
    mov     Limb0, [Op1+48]
    adc     Limb0, [Op2+48]
    mov     [Op3+48], Limb0
    mov     Limb0, [Op1+56]
    adc     Limb0, [Op2+56]
    mov     [Op3+56], Limb0

    lea     Op1, [Op1+64]
    lea     Op2, [Op2+64]
    lea     Op3, [Op3+64]

    dec     Size
    jne     .Loop1

I do have the following observations

* for low values of Size (e.g. <16) the core is capable of delivering 2r1w/cycle that is the execution time in cyles is in line with the number of additions

* for large values of Size (e.g. >=256) the average execution time per addition goes up to 1.21 cycles per addition

* I found no way to leverage this by shuffling operation sequences, changing unroll sizes, etc.

* further analysis showed that the write does not always ends up on issue port 7 as it should (in the optimal case)

* when inserting a "VPOR YMM0, YMM0, YMM0" between first and second addition the execution time goes to 1 cycle per addition reliably for any value of Size - this only works with AVX opcodes, not with SSE

* it looks like the microop dispatcher for GPR gets "confused" on longer sequences and reset by activating the AVX dispatcher once in a while

Question:

* is this known behaviour or just a curious side effect

Kind regards,
MathMan

0 Kudos
0 Replies
Reply