OpenMP guidance

perceptron · ‎01-17-2006

Hi,

I am developing a neural network package using Intel C++ compiler 9.0. The code is so parallel that it is a no brainer to use OpenMP. The problem is to know when not to use it.

Most of my code is vector operations (dot product, vector add, scaling etc.). What I am looking for is some guidance as to when it becomes detrimental to parallelize - for example, it is probably worth parallelizing A.B if dimensionality of vectors A & B is 10^6. But should I parallelize such loops when I expect the typical dimensionality to be 100 or 1000 or 10000?

I would appreciate it if anyone can provide guidance on this.

Btw, I noticed (after spending 2 days tearing my hair out :) that Intel optimizer (/O3 /Qip) does not perform scalar replacement in loops nested inside parallelized loops - it had really slowed my application down.

Also, is to to be expected that parallelized loops are not vectorized? I would have thought it should be possible with static scheduling at least.

Many thanks

P

TimP · ‎01-17-2006

I spent half an hour trying to answer part of your questions, then the site discarded my answer, which was no doubt longer than you wanted to read.
I expect 9.1 to improve on some of the issues you mention, particularly for 64-bit mode. It does make a difference which platforms you are interested in.
Did you try -O1, to reduce aggressiveness of unrolling in vectorized loops? Vectorized dot products are batched into 8 parallel sums at -O2.
Can you assure that your vectors are 16-byte aligned, and that the compiler knows this?

TimP · ‎01-17-2006

I suppose OpenMP parallelization is most efficient when the data sets allocated to the threads approach a full page size apart (4KB for Xeon). Parallel vectorized code is likely to use the full capacity of the memory system, and you would like to avoid duplicate DTLB and cache filling between the threads.

perceptron · ‎01-17-2006

Hi Tim,

Thanks for the reply. I am using the IA32 compiler on XP. Largest vectors that I have tested my system with are about 50K long, and are single precision floats. My development machine is a dual core PentiumD.
I take care to keep things aligned (I am assuming that the standard C++ new operator dishes out aligned arrays). I use option -Zp16 and and take care while accessing memory allocated on the heap so everything should be aligned to 16byte boundries. I don't tell the compiler about it though, because code would become very messy if I start putting __decl... dubris everywhere. Didn't find a compiler option to assume everything aligned.
I was parallelizing almost all the loops with OpenMP, but after you advice, I only parallelize the loops that suck in data (i.e. the outermost ones). These typically have an iteration count of approx. 100K.
Most of the stuff works at a decent speed, now that I take care to replace scalars in the inner loops myself. I was just curious about when to prefer vectorization over OpenMP.

One thing that I would like clarified is that why does the vectorizer not like this: for(i = 0;...) x = tanhf(y)
where tanhf() is coming from the Intel math library? I remember reading somewhere that Intel have vector forms of these simple math functions. What do I need to do to access them?

Discovered another OpenMP bug today: it doesn't compile with shared clause on orphaned 'for' directives, though there is example code in help that does this.

Thanks

TimP · ‎01-17-2006

You can check the contents of libsvml yourself to see which functions are there. Apparently, it's more difficult to write vector versions of sinh() and tanh() than it is for the inverse functions. Among the more evident difficulties is the need to use a method to preserve accuracy for small arguments. If you don't care about that, you can use macro replacement:
#define tanhf(x) (1 - 2/(expf((x)*2) + 1))
to see if it's worth anything in performance.

I agree about the dilemma posed by the performance advantage of alignment directives, and the fact that the gcc version of them isn't accepted by ICL. We persuaded the compiler team that Fortran should align all potentially vectorizable arrays by default, rather than supporting such directives.
I think it's unfortunate that 32-bit tradition apparently dictates that standard new() and malloc() don't assure alignment. Thus the provision of aligned_malloc() and the like.
If your loops have few enough operands, and are long enough, the overhead for taking care of various cases of alignment is tolerable.

Intel_C_Intel · ‎01-25-2006

Dear Perceptron,
Could you please elaborate on the problems you encountered with vectorizing the tanh() function? I see straightforward vectorization of this code (for 7.x 8.x and 9.x compilers).

float x[100], y[100];
.
for (i = 0; i < 100; i++)
x = tanhf(y);

joho.c(7) : (col. 3) remark: LOOP WAS VECTORIZED.

Aart Bik
http://www.aartbik.com/

TimP · ‎01-25-2006

libsvml does include the function vmldTanh4, so the compiler vectorizer should invoke that automatically in appropriate situations.

Intel_C_Intel · ‎01-25-2006

Tim,

No such function! The appropriate function is called _vmlsTanh4() with an "s" for single-precision.The function _vmldTanh2() would be used to vectorize a double-precision version of the loop. In any case, I thought I already made it clear that this loop should vectorize in my previous message, therefore I would like to know what problems the customer encountered.

Aart Bik
http://www.aartbik.com/

Message Edited by abik on 01-25-2006 09:22 AM

perceptron · ‎01-26-2006

Hi tim & abik,

Sorry for the delay - I had not seen the board for a while. Here is the compiler report (/O3 /Qip /Qvec-report3) on IA32:

ActivationFunctions.inl(61) : (col. 3) remark: loop was not vectorized: contains unvectorizable statement at line 62

I get this error for all of the loops where the code (containing tanh, exp, 1/(1 + exp())) in either of these 2 forms:

//in-place

void ApplyForward(size_t I, real *z) const

{

for (size_t i = 0; i I; ++i)

z = tanh(z);

}

// propagators

void ApplyForward(size_t I, const real *a, real *z) const

{

for (size_t i = 0; i I; ++i)

z = tanh(a);

}

real is a typedef for either float or double and depending on what real is, tanh is #defined to be tanhf or tanh (see below). I have only worked with single precision stuff so far (USE_DOUBLE not defined).

Thanks

P

#ifdef __INTEL_COMPILER

#include

#ifndef USE_DOUBLE

#define tanh(x) tanhf(x)

#define sinh(x) sinhf(x)

#define cosh(x) coshf(x)

#define log(x) logf(x)

#define exp(x) expf(x)

.

.

.

#endif

#else // __INTEL_COMPILER

#include

#define isnan(x) _isnan(x)

#endif

TimP · ‎01-26-2006

As a general rule, you should check the pre-processed code yourself, if you don't want to show it in a readable form. The general rule for bug reports (for gcc as well as icc) is to show pre-processed source code. mathimf.h may be intended to prevent vectorization; you will notice that Aart didn't put it in his example.
In your case
z = tanhf(a);
you may also need to declare the arguments with restrict:
float *restrict z, float *restrict a
(with icpc option -restrict, which allows C99 restrict compatibility)

That's another can of worms, as to how various compilers use const and restrict to facilitate optimizations such as vectorization.

Intel_C_Intel · ‎01-26-2006

Dear Perceptron,

Your code was very hard to read and was missing a lot of essential parts, but I suspect you have something like this:

cat joho.cpp

#include
#include

#ifndef USE_DOUBLE
typedef float real;
#define tanh(x) tanhf(x)
#endif

//in-place
void ApplyForward(size_t I, real *z){
for (size_t i = 0; i < I; ++i)
z = tanh(z);
}

// propagators
void ApplyForward(size_t I, const real *a, real *z) {
for (size_t i = 0; i < I; ++i)
z = tanh(a);
}

But even then, I see no problems with vectorization whatsoever (going back all the way to version 7.1 of our compilers):

icl -QxP joho.cpp

joho.cpp(12) : (col. 3) remark: LOOP WAS VECTORIZED.
joho.cpp(18) : (col. 3) remark: LOOP WAS VECTORIZED.

A few comments:

(1) Tim is right, even though const restricts certain forms of assignments, it has no impact on data dependence analysis (I just wrote a few paragraphs on that in the upcoming second edition of the Software Optimization Cookbook, see http://www.intel.com/intelpress/sum_swcb2.htm). Both loops vectorize by default, but you can avoid a runtime overlap test between a and z for the second using restrict or #pragma ivdep before the loop, making the second function slightly more efficient.

(2) Since the alignment of the data pointed to is not known in this context, adding a __assume_aligned() or #pragma vector aligned before both loops avoid runtime peeling for alignment, making both functions slightly more efficient.

(3) Obviously, we still dont know why your code does not vectorize. Feel free to email me the source directly (aart.bik@intel.com) if you want me to investigate this further. You may also want to read online vectorization guidelines at http://www.intel.com/cd/ids/developer/asmo-na/eng/65774.htm or a more detailed description in the Software Vectorization Handbook at http://www.intel.com/intelpress/sum_vmmx.htm.

Hope this helps.

Aart Bik
http://www.aartbik.com/

perceptron · ‎01-26-2006

Hi,

Sorry about the formatting - the website screwed up my copy-pasted code.
I am including mathimf because the documentation states "To use the Intel math library, include the header file, mathimf.h, in your program".

I don't think that using restrict will help, since the vectorizer reports that the statement containing the tanhf() call is not vectorizable. Also, it would pollute the code with arcania.

Maybe this can shed some light: I am linking my executable with a 3rd party static library, which has been compiled with Intel compiler, but I don't know if they used mathimf. Do you think this may confuse the compiler? It is a possiblility, if the vectorization happens at link time, as I have ensured that the headers that I include don't make any reference to standard headers.

Thanks

TimP · ‎01-26-2006

Automatic vectorization occurs only at compile time. You could examine a library to determine whether it contains those svml math function calls which come from auto-vectorization.