Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

SSE vs AVX, kinda deceived

gol
Beginner
856 Views
(not sure it's the best place to post this? plz tell me if it should be somewhere else)

I have a programmer's dream in a project:
-most of the CPU is sucked in one small & simple loop
-that loop hardly has mem access (writing only, & adding to a buffer that's already in the cache)
-all the data is in 32-byte aligned 32-byte packs

So it was perfect for a change from SSE1 to AVX. I had a new I7, was in Win7, everything was perfect except for one thing, I'm writing in Delphi which doesn't support AVX (& no intrinsics either, sigh..)
So I painfully compiled instructions using Flat Assembler, & copied machine code back into Delphi.
Sadly.. the code was hardly faster, less than 10%. On one hand I could benefit of the 3-register use, and since I replaced 2-register 16byte SSE access by 1-register 32byte AVX access, I got new free (it's 32bit code btw) registers, & thus even less memory access. On the other hand, I probably lost multiple instructions per cycle, because it's less parallel. But I can't believe it's still less than 10% faster..
Could it be the extra byte per instruction in the AVX version? (quite ironic since one of the key advantages that's listed for AVX is to be "compact").

My loop looked like this:

MOVAPS xmm3,DQWORD PTR TPckTone(EAX).cb
MOVAPS xmm7,DQWORD PTR TPckTone(EAX+16).cb
MULPS xmm3,xmm1
SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2
MULPS xmm7,xmm5
SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2
MOVAPS xmm2,xmm1 // snm2:=snm1
MOVAPS xmm6,xmm5 // snm2:=snm1
MOVAPS xmm1,xmm3 // snm1:=cos
MULPS xmm3,xmm0 // amp
MOVAPS xmm5,xmm7 // snm1:=cos
MULPS xmm7,xmm4 // amp

ADDPS xmm3,[EDX]
ADDPS xmm3,xmm7 // sum outputs

SUBPS xmm0,[EBX] // DQWORD PTR TPckTone(EAX).GoalVol
MULPS xmm0,[EDI] // DQWORD PTR TPckTone(EAX).VolSpeed
ADDPS xmm0,[EBX] // DQWORD PTR TPckTone(EAX).GoalVol
SUBPS xmm4,[ESI] // DQWORD PTR TPckTone(EAX+16).GoalVol
MULPS xmm4,[EBP] // DQWORD PTR TPckTone(EAX+16).VolSpeed
MOVAPS [EDX],xmm3 // output
ADDPS xmm4,[ESI] // DQWORD PTR TPckTone(EAX+16).GoalVol

ADD EDX,16
SUB ECX,1



& became this (I skipped the code around so it's not quite the same as above, but does the same):


DB $C5,$FC,$28,$FA //VMOVAPS ymm7,ymm2
DB $C5,$DC,$59,$D9 //VMULPS ymm3,ymm4,ymm1
DB $C5,$FC,$28,$D1 //VMOVAPS ymm2,ymm1 // snm2:=snm1
DB $C5,$E4,$5C,$CF //VSUBPS ymm1,ymm3,ymm7 // cos:=cb*snm1-snm2
DB $C5,$F4,$59,$D8 //VMULPS ymm3,ymm1,ymm0 // amp

// add low & high, then output
DB $C4,$E3,$7D,$19,$DF,$01 //VEXTRACTF128 xmm7,ymm3,1
DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol
DB $C5,$E0,$58,$DF //VADDPS xmm3,xmm3,xmm7
DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed
DB $C5,$E0,$58,$1A //VADDPS xmm3,xmm3,[EDX]
DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol
DB $C5,$F8,$29,$1A //VMOVAPS [EDX],xmm3

ADD EDX,16
SUB ECX,1

(attempts at unrolling & reordering lines didn't really succeed. The 3 lines I mixed at the end don't even make a difference actually, while I was expecting it to reduce the latencies)

Now, instead of processing 8 32bit floats at once, I could try to make bigger changes in my code & process 16 at once, thus I'd have exactly the same as the original code, but using ymm registers. But I wonder if it would really improve the speed.. it's quite tedious & it'd be sad if it didn't.


Also, the more I try to understand the rules to end up with multiple instructions per cycle, the less I do. In the past it seemed pretty simple, you could pair operations that were totally independent. But I don't understand how it works these days. For ex at the top you see
MULPS xmm3,xmm1
SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2
MULPS xmm7,xmm5
SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2
which I would normally write like this
MULPS xmm3,xmm1
MULPS xmm7,xmm5
SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2
SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2
..but if I left it as the above version, it was because I reordered stuff at random (what I always do these days) & kept the fastest. But I don't understand it. The same way, the first SSE1 code was totally different & should have worked faster, had no memory reads & was all using registers, but no, it was slower.


(Edit: thanks to the 3 operants/instructions, I could simplify these 3, and it's faster, but for "other reasons")
DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol
DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed
DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol
0 Kudos
2 Replies
TimP
Honored Contributor III
856 Views
I suppose you'd have to count micro-ops to compare effective code size at execution. If you got more fusion in the SSE code, it could contribute to minimizing the difference in run time; as you've already pointed out, multiple issue also contributes.
You don't show what you do about loop body code alignment. I've noticed that Intel compilers take more care about it when the AVX option is set. Without specified alignments, you may get full performance, but more luck is involved.
If your data miss L1, you get no advantage for attempting to access 256 bits on one cycle, regardless of whether you do it with 1 or 2 instructions.
0 Kudos
gol
Beginner
856 Views
>>You don't show what you do about loop body code alignment

I align the loop to 16bytes, but it makes no difference.
(actually I do align in Delphi XE2 only (but I still use Delphi 2007), can you believe that code alignment is a "new feature" in Delphi?)


>>If your data miss L1, you get no advantage for attempting to access 256 bits on one cycle, regardless of >>whether you do it with 1 or 2 instructions.

Well in this loop I only have one read/write, and weirdly it's not what's eating the most (at all). I also prefetch it, so that it's ready when I need to read it (but it's normally in the cache already).
Prefetching is too something I understand in theory, but not in practice. I never ever got any boost by prefetching.

So.. most likely it really is that the SSE2 code has more instructions/cycle possible. Sadly in order to port the same one to AVX, I'd have to reorganize all my data so that it works in packs of 16 instead of 8.


(is it me or quoting doesn't work in the toolbar?)


0 Kudos
Reply