- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
When I compile the following function using the 11.1.046 compiler and the options "-O3 -ip -ansi-alias -restrict -msse3 -unroll-aggressive -mkl=sequential -fp-model precise -fp-model source", I'm definitely not getting any vectorizaton.
void vecExp(double * __restrict in, double * __restrict__ out, int len)
{
__assume_aligned(in,16);
__assume_aligned(out,16);
for(int i=0; i
out = exp(in) ;
}
}
The loop above is actually much simpler than the loop in my real code. In my actual loop, when I remove the exp() call, the loop contains all packed SSE and movaps instructions. The addition of exp() results in switching to all movsd and scalar SSE instructions. This is very bad because my real loop accounts for at least 5% of total memory traffic in my code, and maybe (???) 10% of all my math.
Is there something that can be done to help? It seems that Intel could implement a special version of exp() that operates on two doubles at a time and gets inlined. The exp() function is a very frequent operation in the High Performance Computing community.
Also, am I getting the Intel Math Library version (libimf.a) of exp() by default? I'm afraid to include the mathimf.h header because of this post maying that mathimf.h results in 2x slower code: http://software.intel.com/en-us/forums/showthread.php?t=55915 . If that's true, do I need to direct my compiler to use libmmd.lib which the article says is faster?
Separately, am I getting automatic MKL VML support by using the "-mkl=sequential" option? I tried the -opt-report option, but it doesn't say whether the compiler is substituting VML calls.
Thanks,
-Jeff
Link Copied
8 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you check if it's ok to mix fpu and sse code? If you can't mix, try to move exp stuff to a separate loop if possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - mtlroom
Did you check if it's ok to mix fpu and sse code? If you can't mix, try to move exp stuff to a separate loop if possible.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jeff,
Can you multi-thread the routine?
One thread all vectorized producing producing the non-exp() portion of the results and a second thread using the FPU producing the exp() portion of results.
Using a buffer initialized to NaN's the calculation portion (that producing in in your code) overstrikes the NaN's as it progresses. The exp thread monitors the buffer (in) for transition from NaN to valid number. When transition occures, produce out = exp(in).
for(i=0;i
while(isnan(in) _mm_pause();
out = exp(in);
}
Something like the above for the exp thread. Count could be the pre-known number of results, or you could use a volatile shared variable count, initialized to large number, then reduced when actual count known.
for(i=0;i
while(isnan(in)
{
if(i>=count) break; // count volatile, reduce in input constuction
_mm_pause();
}
if(i>=count) break; // count volatile, reduce in input constuction
out = exp(in);
}
You can experiment as to if you want to use SwitchToThread() or _mm_pause(); or Sleep(0), or...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
Jeff,
Can you multi-thread the routine?
One thread all vectorized producing producing the non-exp() portion of the results and a second thread using the FPU producing the exp() portion of results.
Using a buffer initialized to NaN's the calculation portion (that producing in in your code) overstrikes the NaN's as it progresses. The exp thread monitors the buffer (in) for transition from NaN to valid number. When transition occures, produce out = exp(in).
for(i=0;i
while(isnan(in) _mm_pause();
out = exp(in);
}
Something like the above for the exp thread. Count could be the pre-known number of results, or you could use a volatile shared variable count, initialized to large number, then reduced when actual count known.
for(i=0;i
while(isnan(in)
{
if(i>=count) break; // count volatile, reduce in input constuction
_mm_pause();
}
if(i>=count) break; // count volatile, reduce in input constuction
out = exp(in);
}
You can experiment as to if you want to use SwitchToThread() or _mm_pause(); or Sleep(0), or...
Jim Dempsey
I think the preferred solution would be to have a small core set of math functions available as intrinsics that can take advantage of the SSE cases. I know that financial applicationscan also be heavily dependent on exp(), for instance. I think ifsomene were able to do a seriousanalysis of transcendental functionsused in High Performance Computing apps, they would find thereis ahandful of operations that are used very heavily. It's well worth it to the customer to have those available as intrinsics.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can anybody comment on the mixing of sse2 and fpu code?.. I tried to search the web, but can't get definite answer. Unlike mmx, sse2 registers aren't shared with fpu, however some state or something like that is shared. Can anybody say for sure if mixing sse2 and fpu code may be the reason for slowdowns
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When using multiple threads to produce the "in" (different portions of the array) you might be able to get by with n-1 threads producing the in and one thread producing the out=exp(in); (written to run through the separate portions of the output as produced by each of the other threads).
You could also consider breaking your one loop into two loops, however, the out=exp(in); will be memory write intensive. Using separate thread early (during production of in) would permit the computation portion of producing the in values to overlap with the memory stalls on the write side of the out=exp(in);
As an optimization you could delayproducing the out= until after a complete cache line of in is written. This way the over-striking of the NaN's will not encounter cache evictions. The loop performing the out=exp(in); will have some seemingly unnecessary branching but your interest is in fast code, not elegant code.
As an alternative, you could perform the out=exp(in); loop every cache line filling of in. i.e. if in is float (4 bytes) then compute 16 values of in, followed by 16 conversions of out=exp(in);, then loop back for the next 16 (last iteration mayhave less than 16 results). You may have to finesse the compiler to get it to vectorize due to the apparent short loop lengths, but that should not be too hard to do.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - mtlroom
Can anybody comment on the mixing of sse2 and fpu code?.. I tried to search the web, but can't get definite answer. Unlike mmx, sse2 registers aren't shared with fpu, however some state or something like that is shared. Can anybody say for sure if mixing sse2 and fpu code may be the reason for slowdowns
Hi, I'm not even sure that the fast version of exp() in the Intel Math Library (or MKL VML for that matter) even uses the x87. If it does, I could see how that could complicate things.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jeff_keasler
Hi, I'm not even sure that the fast version of exp() in the Intel Math Library (or MKL VML for that matter) even uses the x87. If it does, I could see how that could complicate things.
Well, just try to stepthrough with the debugger in asm mode to see what it uses. Maybe it also uses the same registers as your code?..
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page