- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello all,
Considering the following code (symbolic asm) on the Intel Skylake architecture
align 16
.Loop1:
mov Limb0, [Op1]
adc Limb0, [Op2]
mov [Op3], Limb0
mov Limb0, [Op1+8]
adc Limb0, [Op2+8]
mov [Op3+8], Limb0
mov Limb0, [Op1+16]
adc Limb0, [Op2+16]
mov [Op3+16], Limb0
mov Limb0, [Op1+24]
adc Limb0, [Op2+24]
mov [Op3+24], Limb0
mov Limb0, [Op1+32]
adc Limb0, [Op2+32]
mov [Op3+32], Limb0
mov Limb0, [Op1+40]
adc Limb0, [Op2+40]
mov [Op3+40], Limb0
mov Limb0, [Op1+48]
adc Limb0, [Op2+48]
mov [Op3+48], Limb0
mov Limb0, [Op1+56]
adc Limb0, [Op2+56]
mov [Op3+56], Limb0
lea Op1, [Op1+64]
lea Op2, [Op2+64]
lea Op3, [Op3+64]
dec Size
jne .Loop1
I do have the following observations
* for low values of Size (e.g. <16) the core is capable of delivering 2r1w/cycle that is the execution time in cyles is in line with the number of additions
* for large values of Size (e.g. >=256) the average execution time per addition goes up to 1.21 cycles per addition
* I found no way to leverage this by shuffling operation sequences, changing unroll sizes, etc.
* further analysis showed that the write does not always ends up on issue port 7 as it should (in the optimal case)
* when inserting a "VPOR YMM0, YMM0, YMM0" between first and second addition the execution time goes to 1 cycle per addition reliably for any value of Size - this only works with AVX opcodes, not with SSE
* it looks like the microop dispatcher for GPR gets "confused" on longer sequences and reset by activating the AVX dispatcher once in a while
Question:
* is this known behaviour or just a curious side effect
Kind regards,
MathMan
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page