- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I expect 9.1 to improve on some of the issues you mention, particularly for 64-bit mode. It does make a difference which platforms you are interested in.
Did you try -O1, to reduce aggressiveness of unrolling in vectorized loops? Vectorized dot products are batched into 8 parallel sums at -O2.
Can you assure that your vectors are 16-byte aligned, and that the compiler knows this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply. I am using the IA32 compiler on XP. Largest vectors that I have tested my system with are about 50K long, and are single precision floats. My development machine is a dual core PentiumD.
I take care to keep things aligned (I am assuming that the standard C++ new operator dishes out aligned arrays). I use option -Zp16 and and take care while accessing memory allocated on the heap so everything should be aligned to 16byte boundries. I don't tell the compiler about it though, because code would become very messy if I start putting __decl... dubris everywhere. Didn't find a compiler option to assume everything aligned.
I was parallelizing almost all the loops with OpenMP, but after you advice, I only parallelize the loops that suck in data (i.e. the outermost ones). These typically have an iteration count of approx. 100K.
Most of the stuff works at a decent speed, now that I take care to replace scalars in the inner loops myself. I was just curious about when to prefer vectorization over OpenMP.
One thing that I would like clarified is that why does the vectorizer not like this: for(i = 0;...) x = tanhf(y)
where tanhf() is coming from the Intel math library? I remember reading somewhere that Intel have vector forms of these simple math functions. What do I need to do to access them?
Discovered another OpenMP bug today: it doesn't compile with shared clause on orphaned 'for' directives, though there is example code in help that does this.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#define tanhf(x) (1 - 2/(expf((x)*2) + 1))
to see if it's worth anything in performance.
I agree about the dilemma posed by the performance advantage of alignment directives, and the fact that the gcc version of them isn't accepted by ICL. We persuaded the compiler team that Fortran should align all potentially vectorizable arrays by default, rather than supporting such directives.
I think it's unfortunate that 32-bit tradition apparently dictates that standard new() and malloc() don't assure alignment. Thus the provision of aligned_malloc() and the like.
If your loops have few enough operands, and are long enough, the overhead for taking care of various cases of alignment is tolerable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Perceptron,
Could you please elaborate on the problems you encountered with vectorizing the tanh() function? I see straightforward vectorization of this code (for 7.x 8.x and 9.x compilers).
float x[100], y[100];
.
for (i = 0; i < 100; i++)
x = tanhf(y);
joho.c(7) : (col. 3) remark: LOOP WAS VECTORIZED.
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim,
No such function! The appropriate function is called _vmlsTanh4() with an "s" for single-precision.The function _vmldTanh2() would be used to vectorize a double-precision version of the loop. In any case, I thought I already made it clear that this loop should vectorize in my previous message, therefore I would like to know what problems the customer encountered.
Aart Bik
http://www.aartbik.com/
Message Edited by abik on 01-25-2006 09:22 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry for the delay - I had not seen the board for a while. Here is the compiler report (/O3 /Qip /Qvec-report3) on IA32:
ActivationFunctions.inl(61) : (col. 3) remark: loop was not vectorized: contains unvectorizable statement at line 62
I get this error for all of the loops where the code (containing tanh, exp, 1/(1 + exp())) in either of these 2 forms:
//in-place
void ApplyForward(size_t I, real *z) const
{
for (size_t i = 0; i I; ++i)
z = tanh(z);
}
// propagators
void ApplyForward(size_t I, const real *a, real *z) const
{
for (size_t i = 0; i I; ++i)
z = tanh(a);
}
real is a typedef for either float or double and depending on what real is, tanh is #defined to be tanhf or tanh (see below). I have only worked with single precision stuff so far (USE_DOUBLE not defined).
Thanks
P
#ifdef __INTEL_COMPILER
#include
#ifndef USE_DOUBLE
#define tanh(x) tanhf(x)
#define sinh(x) sinhf(x)
#define cosh(x) coshf(x)
#define log(x) logf(x)
#define exp(x) expf(x)
.
.
.
#endif
#else // __INTEL_COMPILER
#include
#define isnan(x) _isnan(x)
#endif
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your case
z = tanhf(a);
you may also need to declare the arguments with restrict:
float *restrict z, float *restrict a
(with icpc option -restrict, which allows C99 restrict compatibility)
That's another can of worms, as to how various compilers use const and restrict to facilitate optimizations such as vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Perceptron,
Your code was very hard to read and was missing a lot of essential parts, but I suspect you have something like this:
#include
#ifndef USE_DOUBLE
typedef float real;
#define tanh(x) tanhf(x)
#endif
//in-place
void ApplyForward(size_t I, real *z){
for (size_t i = 0; i < I; ++i)
z = tanh(z);
}
// propagators
void ApplyForward(size_t I, const real *a, real *z) {
for (size_t i = 0; i < I; ++i)
z = tanh(a);
}
But even then, I see no problems with vectorization whatsoever (going back all the way to version 7.1 of our compilers):
joho.cpp(12) : (col. 3) remark: LOOP WAS VECTORIZED.
joho.cpp(18) : (col. 3) remark: LOOP WAS VECTORIZED.
A few comments:
(1) Tim is right, even though const restricts certain forms of assignments, it has no impact on data dependence analysis (I just wrote a few paragraphs on that in the upcoming second edition of the Software Optimization Cookbook, see http://www.intel.com/intelpress/sum_swcb2.htm). Both loops vectorize by default, but you can avoid a runtime overlap test between a and z for the second using restrict or #pragma ivdep before the loop, making the second function slightly more efficient.
(2) Since the alignment of the data pointed to is not known in this context, adding a __assume_aligned() or #pragma vector aligned before both loops avoid runtime peeling for alignment, making both functions slightly more efficient.
(3) Obviously, we still dont know why your code does not vectorize. Feel free to email me the source directly (aart.bik@intel.com) if you want me to investigate this further. You may also want to read online vectorization guidelines at http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm or a more detailed description in the Software Vectorization Handbook at http://www.intel.com/intelpress/sum_vmmx.htm.
Hope this helps.
Aart Bik
http://www.aartbik.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry about the formatting - the website screwed up my copy-pasted code.
I am including mathimf because the documentation states "To use the Intel math library, include the header file, mathimf.h, in your program".
I don't think that using restrict will help, since the vectorizer reports that the statement containing the tanhf() call is not vectorizable. Also, it would pollute the code with arcania.
Maybe this can shed some light: I am linking my executable with a 3rd party static library, which has been compiled with Intel compiler, but I don't know if they used mathimf. Do you think this may confuse the compiler? It is a possiblility, if the vectorization happens at link time, as I have ensured that the headers that I include don't make any reference to standard headers.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page