Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

SSE2 vectorized code seems to run slower than non-vectorized code

acctpurge_a_1
Novice
861 Views

Hi everyone:

This is my first time posting to the forum. I have a lot of experience designing and optimizing assembly language routines for signal processing. Until recently, this was on what you might call predictable architectures (Moto 56k DSP and PowerPC). I am now doing this on an x86, and having difficulty understanding where timing changes are occurring.

I'm only interested in optimizing one loop: a second-order section (a basic building block of digital filters). The loop processes one sample, reading it in from memory, doing the math and updating the internal states, and writing the sample out again (on top of where it was read from). The loop executes N times, where N is the number of samples in the buffer.

I began by writing the loop in C++, then turning on the SSE2 optimizations in the compiler (Visual Studio) and grabbing the disassembly. I then manually removed all the unnecessary re-loads of registers that never changed and whatnot, and ended up with something like a 15% improvement in speed. So far so good.

All the arithmetic is double precision, with the core being five multiplies and four adds. The compiler did not vectorize. I combined four multiplies into two vectorized multiplies, and three adds into two vectorized adds (there is a redundant add to zero). Using both halves of XMM registers also freed up space so all the constants could be loaded into registers outside the loop. The total code length is shorter. And yet, this version runs slightly *slower* than the previous version (by maybe 5%). I've verified that the code is in fact correct and the numbers are not becoming denormalized.

I've read a good chunk of Agner Fog's work on this topic, and I believe I've not done anything too foolish. I've vainly tried moving instructions around in the loop. The loop code is aligned on a 16-byte boundary. None of this has made any difference. (This is all on a Core i7-860.) So at this point I'm pretty confused. If anyone can help me out with some understanding or things to try, I'd be very grateful. With that, here's the non-vectorized loop (I've omitted the preamble in which some registers are loaded up):

_loop:
 movsd xmm7,mmword ptr [eax] // x = samples
 movapd xmm6,xmm7 // x
 mulsd xmm6,xmm1 // b0*x
 addsd xmm6,xmm4 // v0 = b0*x + z1
 movsd xmm4,mmword ptr [b1]
 mulsd xmm4,xmm7 // b1*x
 movsd xmm0,mmword ptr [a1]
 mulsd xmm0,xmm6 // v0*a1
 subsd xmm4,xmm0 // x*b1-v0*a1
 addsd xmm4,xmm5 // x*b1-v0*a1+z2
 movsd mmword ptr [eax],xmm6 // samples = v0
 add eax, 8 // i++
 movsd xmm5,mmword ptr [b2]
 mulsd xmm5,xmm7 // x*b2
 mulsd xmm6,mmword ptr [a2] // v0*a2
 subsd xmm5,xmm6 // x*b2-v0*a2
 sub ecx, 1
 jg _loop

And here is the vectorized version (again omitting the preamble):

_loop:
 movsd xmm7,mmword ptr [eax] // x = samples
 movapd xmm6,xmm7 // x
 mulsd xmm6,xmm1 // x*b0
 addsd xmm6,xmm4 // v0 = x*b0 + z1
 movsd mmword ptr [eax],xmm6 // samples = v0
 add eax, 8 // i++
 shufpd xmm7, xmm7, 0 // x|x
 mulpd xmm7, xmm2 // x*b2|x*b1
 shufpd xmm6, xmm6, 0 // v0|v0
 mulpd xmm6, xmm3 // v0*a2|v0*a1
 subpd xmm7, xmm6 // x*b2-v0*a2|x*b1-v0*a1
 psrldq xmm4, 8 // 0|z2
 addpd xmm4, xmm7 // x*b2-v0*a2|x*b1-v0*a1+z2
 sub ecx, 1
 jg _loop

Thanks in advance for any help you can give.


Best wishes,

Tom Kite

0 Kudos
12 Replies
SergeyKostrov
Valued Contributor II
862 Views
>>...I began by writing the loop in C++, then turning on the SSE2 optimizations in the compiler (Visual Studio)... What compiler, Intel or Microsoft, and what options do you use? If there is a hard-coded unrolling in your codes it could "conflict" with compiler's attempt to do optimizations ( vectorization or also known as unrolling ) especially when options /O2 or /O3 used. By removing the hard-coded unrolling and making a rolled processing in a loop it is possible to boost performance of Intel C++ generated codes by more than ~1.75x in some cases ( based on my recent verification in my codes ). Note: The processing I'm talking about is very memory bound and GBs of data are processed. Intel C++ compiler generated codes outperform Microsoft, MinGW and Borland C++ compiler generated codes.
0 Kudos
acctpurge_a_1
Novice
862 Views

Hi Sergey:

All our coding is done in Visual Studio. Optimizations are /O2 and /Ot, plus of course SSE2 is enabled. As I said, I took code generated with these optimizations and then improved it by removing unnecessary loads and stores (which didn't do anything), reducing execution time by 15%. After that I vectorized the code in a couple of places, combining pairs of mulsd into mulpd. The loop structure was otherwise unchanged. I wouldn't consider this unrolling. Also, I'm sure this code is FPU-bound, not memory bound.

Tom

0 Kudos
bronxzv
New Contributor II
862 Views

tomk@ap.com wrote:

And here is the vectorized version (again omitting the preamble):

_loop:
 movsd xmm7,mmword ptr [eax] // x = samples
 movapd xmm6,xmm7 // x
 mulsd xmm6,xmm1 // x*b0
 addsd xmm6,xmm4 // v0 = x*b0 + z1
 movsd mmword ptr [eax],xmm6 // samples = v0
 add eax, 8 // i++
 shufpd xmm7, xmm7, 0 // x|x
 mulpd xmm7, xmm2 // x*b2|x*b1
 shufpd xmm6, xmm6, 0 // v0|v0
 mulpd xmm6, xmm3 // v0*a2|v0*a1
 subpd xmm7, xmm6 // x*b2-v0*a2|x*b1-v0*a1
 psrldq xmm4, 8 // 0|z2
 addpd xmm4, xmm7 // x*b2-v0*a2|x*b1-v0*a1+z2
 sub ecx, 1
 jg _loop

I will not call it "vectorized"  since it process actually one element per loop, it's just using a few packed instructions to decrease a bit the total instruction count

there is long dependency chains in the critical path, and worse, loop iterations aren't independent since the output to xmm4 is used as input at the next iteration, I'll say that this code is clearly execution latency bound (*) and can't possibly use all the throughput available, only trully vectorized code with independent iterations can benefit from a good speedup with the packed instructions, unfortunately not all algorithms are easily amenable to a vectorized form

(*) have a try with 2 threads and hyperthreading it should provide a good speedup and help the shorter "vectorized" version to scale better than the other one (provided you are not memory bandwidth bound)

0 Kudos
acctpurge_a_1
Novice
862 Views

Hi Bronxzv:

You are right - this algorithm maintains state from one iteration to the next. There is no way around it. I understand your distinction between true vectorization and simple instruction packing. I'm still a bit disappointed that the shorter code doesn't execute faster (actually, a little slower). I'm guessing that is more bad luck than anything with the way latencies happen to line up.

Anyway, thanks for your comments. You've convinced me that vectorization in SSE2 is not quite as simple as finding multiplies that can be done at the same time.

Best wishes,

Tom

0 Kudos
bronxzv
New Contributor II
862 Views

tomk@ap.com wrote:
Anyway, thanks for your comments. You've convinced me that vectorization in SSE2 is not quite as simple as finding multiplies that can be done at the same time.

I'm enduring it these days with AVX2 FMA (5 clocks latency and 16 DP flops per clock available throughput, that's 80 flops for the duration of a single instruction execution) where it's even more important to study the total latency of your critical path (try to split it in several independent paths each with roughly same total latency) and to avoid like the plague dependencies between loop iterations

0 Kudos
jimdempseyatthecove
Honored Contributor III
862 Views

In looking at your assembly code it looks as if you have two xmm registers available. If this is worth your time, I would suggest investigating if you can fetch samples[i+1] | samples in one movpd,  defer the v0= and store of samples (actually samples[i+1] | samples), unrolling the "pd" part of the loop twice, computing each half, then storing the samples[i+1] | samples) before looping back. An off the cuff guess is you might be able to recover 5 clock ticks per sample.

 Jim Dempsey

0 Kudos
acctpurge_a_1
Novice
862 Views

@Jim:

Thanks for the suggestion - are you thinking that the parallel loads/saves alone will save the time? I can definitely look into that. (The computation of v0 can't be delayed because v0 is needed to compute everything else.)

Bronxzv's point still holds that there is a dependency chain from one loop to the next, and it can't be broken if the filter is implemented in this way.

Best wishes, Tom

0 Kudos
bronxzv
New Contributor II
862 Views

tomk@ap.com wrote:
Bronxzv's point still holds that there is a dependency chain from one loop to the next, and it can't be broken if the filter is implemented in this way.

As is it is the case with all IIR filters AFAIK, the best to maximize the speedup will be to process several channels in parallel, for example the left channel using element 1 and the right channel element 0, more opportunities with 5:1 audio, etc.. It can work well even if the filter configuration parameters aren't the same but obviously you have to apply the very same algorithm to all channels.

0 Kudos
acctpurge_a_1
Novice
862 Views

Hi Bronxzv:

You are right that the best way to exploit parallelism is to process several channels at once, either by using SSE2 vectorizations or by distributing channels over different cores. However, we were also looking to get a speedup on single channels.

One way to do that is to refactor two series second-order sections into a two parallel sections. The two sections can then run independently until the point where their outputs have to be summed to form the output sample. A simple transformation is needed on the filter coefficients to achieve the identical response. I'm looking into this now and the results look very promising.

Tom

0 Kudos
bronxzv
New Contributor II
862 Views

tomk@ap.com wrote:

Hi Bronxzv:

You are right that the best way to exploit parallelism is to process several channels at once, either by using SSE2 vectorizations or by distributing channels over different cores. However, we were also looking to get a speedup on single channels.

One way to do that is to refactor two series second-order sections into a two parallel sections. The two sections can then run independently until the point where their outputs have to be summed to form the output sample. A simple transformation is needed on the filter coefficients to achieve the identical response. I'm looking into this now and the results look very promising.

Tom

Interesting, sounds good, you may well as an added bonus enjoy better numerical accuracy as it's typical when processing long summations in several independent chunks with a final merge, on top of that AVX2 FMA may provide extra accuracy so maybe you will be allowed to use floats with a S/N profile similar to the previous algorithm with doubles, but with a much higher throughput ? (my rough guesses)

0 Kudos
Bernard
Valued Contributor I
862 Views

Coming late to this discussion:)

First @Sergey following your sentence about the unrolling I cannot see any unrolling in presented assembly code.That code is processing one double precision element by loop cycle.if presented code has any invariants it can moved  outside of the loop.By crude estimation first two lines of vectorized code cannot be executed in parallel also next two lines of code are dependent on each other effecticvely stalling two ececution ports.

0 Kudos
Bernard
Valued Contributor I
862 Views

I wonder how LSD mechanism could help in this case?It seems that main loop has less than 28 micro-ops.

0 Kudos
Reply