(not sure it's the best place to post this? plz tell me if it should be somewhere else)
I have a programmer's dream in a project: -most of the CPU is sucked in one small & simple loop -that loop hardly has mem access (writing only, & adding to a buffer that's already in the cache) -all the data is in 32-byte aligned 32-byte packs
So it was perfect for a change from SSE1 to AVX. I had a new I7, was in Win7, everything was perfect except for one thing, I'm writing in Delphi which doesn't support AVX (& no intrinsics either, sigh..) So I painfully compiled instructions using Flat Assembler, & copied machine code back into Delphi. Sadly.. the code was hardly faster, less than 10%. On one hand I could benefit of the 3-register use, and since I replaced 2-register 16byte SSE access by 1-register 32byte AVX access, I got new free (it's 32bit code btw) registers, & thus even less memory access. On the other hand, I probably lost multiple instructions per cycle, because it's less parallel. But I can't believe it's still less than 10% faster.. Could it be the extra byte per instruction in the AVX version? (quite ironic since one of the key advantages that's listed for AVX is to be "compact").
& became this (I skipped the code around so it's not quite the same as above, but does the same):
DB $C5,$FC,$28,$FA //VMOVAPS ymm7,ymm2 DB $C5,$DC,$59,$D9 //VMULPS ymm3,ymm4,ymm1 DB $C5,$FC,$28,$D1 //VMOVAPS ymm2,ymm1 // snm2:=snm1 DB $C5,$E4,$5C,$CF //VSUBPS ymm1,ymm3,ymm7 // cos:=cb*snm1-snm2 DB $C5,$F4,$59,$D8 //VMULPS ymm3,ymm1,ymm0 // amp
// add low & high, then output DB $C4,$E3,$7D,$19,$DF,$01 //VEXTRACTF128 xmm7,ymm3,1 DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol DB $C5,$E0,$58,$DF //VADDPS xmm3,xmm3,xmm7 DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed DB $C5,$E0,$58,$1A //VADDPS xmm3,xmm3,[EDX] DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol DB $C5,$F8,$29,$1A //VMOVAPS [EDX],xmm3
ADD EDX,16 SUB ECX,1
(attempts at unrolling & reordering lines didn't really succeed. The 3 lines I mixed at the end don't even make a difference actually, while I was expecting it to reduce the latencies)
Now, instead of processing 8 32bit floats at once, I could try to make bigger changes in my code & process 16 at once, thus I'd have exactly the same as the original code, but using ymm registers. But I wonder if it would really improve the speed.. it's quite tedious & it'd be sad if it didn't.
Also, the more I try to understand the rules to end up with multiple instructions per cycle, the less I do. In the past it seemed pretty simple, you could pair operations that were totally independent. But I don't understand how it works these days. For ex at the top you see MULPS xmm3,xmm1 SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2 MULPS xmm7,xmm5 SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2 which I would normally write like this MULPS xmm3,xmm1 MULPS xmm7,xmm5 SUBPS xmm3,xmm2 // cos:=cb*snm1-snm2 SUBPS xmm7,xmm6 // cos:=cb*snm1-snm2 ..but if I left it as the above version, it was because I reordered stuff at random (what I always do these days) & kept the fastest. But I don't understand it. The same way, the first SSE1 code was totally different & should have worked faster, had no memory reads & was all using registers, but no, it was slower.
(Edit: thanks to the 3 operants/instructions, I could simplify these 3, and it's faster, but for "other reasons") DB $C5,$FC,$5C,$C5 //VSUBPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol DB $C5,$FC,$59,$C6 //VMULPS ymm0,ymm0,ymm6 // TPckTone(EAX).VolSpeed DB $C5,$FC,$58,$C5 //VADDPS ymm0,ymm0,ymm5 // TPckTone(EAX).GoalVol
I suppose you'd have to count micro-ops to compare effective code size at execution. If you got more fusion in the SSE code, it could contribute to minimizing the difference in run time; as you've already pointed out, multiple issue also contributes. You don't show what you do about loop body code alignment. I've noticed that Intel compilers take more care about it when the AVX option is set. Without specified alignments, you may get full performance, but more luck is involved. If your data miss L1, you get no advantage for attempting to access 256 bits on one cycle, regardless of whether you do it with 1 or 2 instructions.
>>You don't show what you do about loop body code alignment
I align the loop to 16bytes, but it makes no difference. (actually I do align in Delphi XE2 only (but I still use Delphi 2007), can you believe that code alignment is a "new feature" in Delphi?)
>>If your data miss L1, you get no advantage for attempting to access 256
bits on one cycle, regardless of >>whether you do it with 1 or 2
Well in this loop I only have one read/write, and weirdly it's not what's eating the most (at all). I also prefetch it, so that it's ready when I need to read it (but it's normally in the cache already). Prefetching is too something I understand in theory, but not in practice. I never ever got any boost by prefetching.
So.. most likely it really is that the SSE2 code has more instructions/cycle possible. Sadly in order to port the same one to AVX, I'd have to reorganize all my data so that it works in packs of 16 instead of 8.
(is it me or quoting doesn't work in the toolbar?)