Weeks of optimizations and still can't beat the intel compiler

mr_nuke · ‎02-19-2010

I have some FLOP-intensive code that only gets around 25% - 30% of the theoretical performance on a Corei7-920.

I'm attaching the code as a Visual Studio project. In electro_SSE.cpp, you will find three implementations of int CalcField_CPU_T_Curvature. The first one is a generic template, completely unaware of vectorization. the second is my attempt (quite awful, I would say) at manually vectorizing everything with inline assembly, and the third is a more elegant implementation using intrinsics, that also reduces the number of memory loads by a factor of 4, versus the second version.

All three execute at the exact speed of 22-23 GFLOP/s. The only difference between the first and secon two implementations is that the first will cause the coompiler to generate rsqrt at one point, while, the other two use the more precise sqrt.

This limitation in performance makes me believe that the CPU may not be capable of more for this specific dataset, as I have already expressed here:

http://software.intel.com/en-us/forums/showthread.php?t=71615

Some have requested that I post a simplified version of the source code, so here is an isolated test case that will work on its own

==Alex

TimP · ‎02-19-2010

I had plenty of trouble with the project file, but I couldn't find that you set /Qansi-alias which I needed in order to beat your quoted performance. /QxSSE4.2 -O3 helped a little as well.

The run appears to be too short to get consistent timings from QueryPerformance. Perhaps _rdtsc would do better.

VTune tells me you are running 7 threads when I don't set -Qip-. When I do set that, it spends most of the time in 3 threads. But VTune never sees the faster cases.

I'm having trouble understanding your organization, to see whether you gave each thread a private copy of the variables which are written, particularly the 4 element Accum[] struct. Maybe these issues have influence on the erratic timing. Unfortunately, I'm running win7, so there is no Thread Checker.

Did you make a reference version in C so we could check what the vectorizing compiler does with it?

It looks like removing the parallel sections enables it to perform consistently.

mr_nuke · ‎02-19-2010

Thanks for looking over the code.

What trouble did you have with the project file?

Here are the compiler optons (I actually use the Intel C++ compiler):

[plain]/c /O3 /Ob2 /Oi /Ot /Qipo /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /EHsc /MT /GS /arch:SSE3 /fp:fast=2 /Fo"x64Release/" /W3 /nologo /Qopenmp /QaxSSE4.2 /Qparallel[/plain]

/Qansi-alias and /QxSSE4.2 don't seem to affect performance in any way.

I think the inconsistent timing comes from thread creation overhead. I forgot to put "normal" in the command line arguments. If you set that, it should use a large enough dataset (33 second runtime) that the thread creation overhead is negligible.

Also, regarding compiler performance, cl reaches 23 GFLOP/s, while ICL is struggling to get over 22. I think this is due to some unaligned 'movups' ICL is generating on aligned data.

The prevPoint[], Accum[], and prevAccum[] elements are local to each 'line', and thus have to be thread local in order for the results to be correct.

I'm also pretty certain that, on the large dataset, all 8 threads are fully utilized. I've checked this over hundreds of runs with task manager, and by watching core temperature rise from 60C to 87-89C (same as what several hours of Prime95 would do). I might be wrong of course, but removing the parallel sections doesn't affect performance on my system (for the large dataset) a tidbit. Thats because I specifically enable dynamic and nested omp.

I don't have a C variant of the code. What I can say is that the first function (the template one) got 'rsqrtps' from the compiler, instead of 'sqrtps' (which is what I use). I can pass 30GFLOP/s by using rsqrtps in the hot (innermost) loop , but the precision is too low for my needs.

On a side note, what platform are you running this on? I'm using a Core-i7 920, HT enabled, Turbo-Boost enabled, with x3 channel DDR3-1600, and QPI @ 6.4GT/s, 64-bit OS.

TimP · ‎02-19-2010

movups is documented as having the same performance as movaps on Core i7, provided that store to memory is aligned. On recent AMD CPUs, supposedly even the latter requirement may be avoided.

If the compiler uses rsqrtps, it follows up with a Newton iteration to produce results which should not vary more than 1.5 ULP from the IEEE accurate one. I set /Qprec-div /Qprec-sqrt so it will not use rsqrtps.

I'm running on Core I7 with HT disabled. On my version, Turbo isn't active when more than one core is in use. I have the 4 slot model with a DIMM in each slot, which, according to documentation, cuts memory performance back to 1066. I don't feel like looking it up, but I don't believe any Core I7 is rated as high as 6.4GT/s.

My Core I7, as originally configured, overheated chronically, due to the exhaust from the Nvidia card being directed into the CPU. I haven't had any issue after changing air flow path.

Your normal run option gives me a stable report of 26.7GFLOPs. VTune event sampling does show clearly that the pipeline is stalled during the sqrtps and divps operations, which is expected.

tthsqe · ‎02-20-2010

tim18,

What can pipeline stalls on rsqrtps and divps be filled with? more rsqrtps and divps? just mulps and addps?

TimP · ‎02-20-2010

Replacing sqrtps by rsqrtps (and iterative improvement) nearly eliminates the pipeline stall. No other operations can proceed in the FPU during the sqrt and div pipeline stalls; only operations assigned to other ports (in real applications, most likely would be reads from memory).

tthsqe · ‎02-20-2010

Ok. If I'm understanding you correctly, the approximate square root and reciprocalfunctions allow for concurrent multiplcations andadditions.Then given this, how can we maximize throughputon the calculation of x = a^(-3/2)? Here is what I am currently thinking:

x = RCPPS(a)*RSQRTPS(a) ; accurate to about 12 bits

x = 0.5*x*(3.0 - a*(a*x)^2);accurate to about24 bits

maximum latency= 6 (RCPPS) + 6 (RCPPS)+ 6*7 (MULPS) + 5 (SUBPS) = 59

and this right way seems better than

x = 1.0 / SQRTPS(a*a*a)

minimum latency = 40 (SQRT)+16 (DIV) + 2*7 (MULPS)= 70

TimP · ‎02-21-2010

Yes, you have several options, based on several associations of

1/a/sqrt(a)

and cubing first is most likely to increase round-off error or run into over/underflow.

Combinations involving the IEEE accurate sqrtps and divps, unfortunately, will be slower than other vectorized choices.

As you point out, by writing iterative improvement out yourself, you might avoid extra code and reduce round-off error incurred when a compiler iterates on both 1/a and 1/sqrt(a).

Latency of sqrtps was smaller on Penryn architecture; don't know if the faster sqrt may re-appear on a future CPU.

mr_nuke · ‎02-22-2010

The way I usually do it is x = 1/(a*sqrt(a)), as I'm not a big fan of approximations. That, and I don't quite understand how those approximations can produce more accurate results than what rcpps and rsqrt give.

I also noticed that the compiler likes to replace divps, by rcpps, 2 mulps, an addps, and a subps, which, at least in my test case is slower. On the first function in my code, the template one, the compiler does this approximation without /Qprec-div, and gets to the 22-23 GFLOP/s I originally quoted. With /Qprec-div, divps is used, and that brings things to a nicer 25 GFLOP/s. I believe that's how you got 26.7 GFLOP/s, but in rcpps is still used, and the results a little too off.

Anyway, considering that I'm using a sqrtps for every 18 FLOPs, could that explain the crappy performance I'm seeing?

Relacing a combination of sqrt/div/mul, by a combination of rsqrt/mul/div, I can get to 30 GFLOP/s (in the inline assembly function) but at a huge accuracy cost. I still think I may be bound by something other than compute power or efficiency.

== Alex

tthsqe · ‎02-24-2010

as quick test I ran

movaps xm0,

rcpps xm1,xm0

rsqrtps xm2,xm0

mulps xm1,xm2

movaps xm2,xm1

mulps xm2,[mhalf]

mulps xm1,

mulps xm1,xm1

mulps xm1,

addps xm1,[mthree]

mulps xm1,xm2

with the data:

a dd 0.1,0.5,2.0,10.0

mhalf dd -0.5,-0.5,-0.5,-0.5

mthree dd -3.0,-3.0,-3.0,-3.0

and got in xm1 the results:

3.1622774*10^1

2.8284261*10^0,

3.5355327*10^-1,

3.1622764*10^-2,

compare them with the more accurate

3.16227766016838*10^1

2.82842712474619*10^0,

3.53553390593274*10^-1,

3.16227766016838*10^-2,

As tim said, the main problem is that div and sqrt completely stall the pipeline. May be intel should design a cpu that doesn't do this?? On the plus side, if you do use Newton's method, you can count a^-(3/2) as 8 flops instead of one. :) The only time you can think about approaching the maximum of 4 flops per clock is with mul's and add/sub's (preferably in equal numbers)