exp() blocking vectorization

jeff_keasler · ‎08-08-2009

Hi,

When I compile the following function using the 11.1.046 compiler and the options "-O3 -ip -ansi-alias -restrict -msse3 -unroll-aggressive -mkl=sequential -fp-model precise -fp-model source", I'm definitely not getting any vectorizaton.

void vecExp(double * __restrict in, double * __restrict__ out, int len)
{
__assume_aligned(in,16);
__assume_aligned(out,16);
for(int i=0; i {
out = exp(in) ;
}
}

The loop above is actually much simpler than the loop in my real code. In my actual loop, when I remove the exp() call, the loop contains all packed SSE and movaps instructions. The addition of exp() results in switching to all movsd and scalar SSE instructions. This is very bad because my real loop accounts for at least 5% of total memory traffic in my code, and maybe (???) 10% of all my math.

Is there something that can be done to help? It seems that Intel could implement a special version of exp() that operates on two doubles at a time and gets inlined. The exp() function is a very frequent operation in the High Performance Computing community.

Also, am I getting the Intel Math Library version (libimf.a) of exp() by default? I'm afraid to include the mathimf.h header because of this post maying that mathimf.h results in 2x slower code: http://software.intel.com/en-us/forums/showthread.php?t=55915 . If that's true, do I need to direct my compiler to use libmmd.lib which the article says is faster?

Separately, am I getting automatic MKL VML support by using the "-mkl=sequential" option? I tried the -opt-report option, but it doesn't say whether the compiler is substituting VML calls.

Thanks,
-Jeff

mtlroom · ‎08-08-2009

Did you check if it's ok to mix fpu and sse code? If you can't mix, try to move exp stuff to a separate loop if possible.

jeff_keasler · ‎08-08-2009

Quoting - mtlroom

Did you check if it's ok to mix fpu and sse code? If you can't mix, try to move exp stuff to a separate loop if possible.

I.m not sure what the guts of the exp function look like. Unfortunately, my code loads from some arrays at the top, then does alot of scalar math based on those loads, then does the exp (dependent on scalar math), then does just a few scalar math operations, then stores to an array. Basically, worst case problem.

jimdempseyatthecove · ‎08-09-2009

Jeff,

Can you multi-thread the routine?
One thread all vectorized producing producing the non-exp() portion of the results and a second thread using the FPU producing the exp() portion of results.

Using a buffer initialized to NaN's the calculation portion (that producing in in your code) overstrikes the NaN's as it progresses. The exp thread monitors the buffer (in) for transition from NaN to valid number. When transition occures, produce out = exp(in).

for(i=0;i{
while(isnan(in) _mm_pause();
out = exp(in);
}

Something like the above for the exp thread. Count could be the pre-known number of results, or you could use a volatile shared variable count, initialized to large number, then reduced when actual count known.

for(i=0;i{
while(isnan(in)
{
if(i>=count) break; // count volatile, reduce in input constuction
_mm_pause();
}
if(i>=count) break; // count volatile, reduce in input constuction
out = exp(in);
}

You can experiment as to if you want to use SwitchToThread() or _mm_pause(); or Sleep(0), or...

Jim Dempsey

jeff_keasler · ‎08-09-2009

Quoting - jimdempseyatthecove

Jeff,

Can you multi-thread the routine?
One thread all vectorized producing producing the non-exp() portion of the results and a second thread using the FPU producing the exp() portion of results.

Using a buffer initialized to NaN's the calculation portion (that producing in in your code) overstrikes the NaN's as it progresses. The exp thread monitors the buffer (in) for transition from NaN to valid number. When transition occures, produce out = exp(in).

for(i=0;i{
while(isnan(in) _mm_pause();
out = exp(in);
}

Something like the above for the exp thread. Count could be the pre-known number of results, or you could use a volatile shared variable count, initialized to large number, then reduced when actual count known.

for(i=0;i{
while(isnan(in)
{
if(i>=count) break; // count volatile, reduce in input constuction
_mm_pause();
}
if(i>=count) break; // count volatile, reduce in input constuction
out = exp(in);
}

You can experiment as to if you want to use SwitchToThread() or _mm_pause(); or Sleep(0), or...

Jim Dempsey

Thanks Jim. What you suggest would probably work for me, but I think it would ultimately drag down my performance. This loop is within the context of a parallel code, where each taskwill already be assigned to each available thread. I'm optimizing the serialcase right now, even though I realize I may need to back off on some of these optimizations to get better overall throughput.

I think the preferred solution would be to have a small core set of math functions available as intrinsics that can take advantage of the SSE cases. I know that financial applicationscan also be heavily dependent on exp(), for instance. I think ifsomene were able to do a seriousanalysis of transcendental functionsused in High Performance Computing apps, they would find thereis ahandful of operations that are used very heavily. It's well worth it to the customer to have those available as intrinsics.

mtlroom · ‎08-09-2009

Can anybody comment on the mixing of sse2 and fpu code?.. I tried to search the web, but can't get definite answer. Unlike mmx, sse2 registers aren't shared with fpu, however some state or something like that is shared. Can anybody say for sure if mixing sse2 and fpu code may be the reason for slowdowns

jimdempseyatthecove · ‎08-09-2009

When using multiple threads to produce the "in" (different portions of the array) you might be able to get by with n-1 threads producing the in and one thread producing the out=exp(in); (written to run through the separate portions of the output as produced by each of the other threads).

You could also consider breaking your one loop into two loops, however, the out=exp(in); will be memory write intensive. Using separate thread early (during production of in) would permit the computation portion of producing the in values to overlap with the memory stalls on the write side of the out=exp(in);

As an optimization you could delayproducing the out= until after a complete cache line of in is written. This way the over-striking of the NaN's will not encounter cache evictions. The loop performing the out=exp(in); will have some seemingly unnecessary branching but your interest is in fast code, not elegant code.

As an alternative, you could perform the out=exp(in); loop every cache line filling of in. i.e. if in is float (4 bytes) then compute 16 values of in, followed by 16 conversions of out=exp(in);, then loop back for the next 16 (last iteration mayhave less than 16 results). You may have to finesse the compiler to get it to vectorize due to the apparent short loop lengths, but that should not be too hard to do.

Jim Dempsey

jeff_keasler · ‎08-09-2009

Quoting - mtlroom

Can anybody comment on the mixing of sse2 and fpu code?.. I tried to search the web, but can't get definite answer. Unlike mmx, sse2 registers aren't shared with fpu, however some state or something like that is shared. Can anybody say for sure if mixing sse2 and fpu code may be the reason for slowdowns

Hi, I'm not even sure that the fast version of exp() in the Intel Math Library (or MKL VML for that matter) even uses the x87. If it does, I could see how that could complicate things.

mtlroom · ‎08-10-2009

Quoting - jeff_keasler

Hi, I'm not even sure that the fast version of exp() in the Intel Math Library (or MKL VML for that matter) even uses the x87. If it does, I could see how that could complicate things.

Well, just try to stepthrough with the debugger in asm mode to see what it uses. Maybe it also uses the same registers as your code?..