Solved: Problem when compiling with icc

nostradamus · ‎04-28-2010

Hello,

I have a problem when I compile with icc whereas it works with gcc. I only use one function Dsyev. With icc the program return a segmentation fault error. I wonder if this comes from my compilation options :

icc -o Dsyev Dsyev.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lrt -lm

gcc -o Dsyev Dsyev.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lrt -lm

I had no such problem when using icc for non intel mkl functions.

TimP · ‎04-28-2010

Many things could happen. icc defaults to -O which resembles gcc -O3 -ffast-math -fno-cx-limited-range (making more demands on the robustness of your code), while gcc defaults to -O0. As a consequence, the icc build is likely to require more stack, so you may need to boost stack limit in your shell.

View solution in original post

TimP · ‎04-28-2010

Many things could happen. icc defaults to -O which resembles gcc -O3 -ffast-math -fno-cx-limited-range (making more demands on the robustness of your code), while gcc defaults to -O0. As a consequence, the icc build is likely to require more stack, so you may need to boost stack limit in your shell.

nostradamus · ‎04-29-2010

Thank you very much, it works now.

I noticed that execution time of my program is not different between intel compiler and gcc when i use intel MKL functions. Are these functions are already using sse (i'm new to icc and intel mkl) ??

TimP · ‎04-29-2010

Performance of the MKL functions should be the same when called from code compiled by either compiler.

nostradamus · ‎04-30-2010

Thank you. I have one last question about compiling :

My program runs very fast (i use one heev function) using one core of one of my 2 CPU. As i work on small matrices, i set one heev on one of the 8 cores on my old program (without MKL) with pragma sections. I had a speedup of about 6,5.

Now i'd like to do the same with MKL but something must be wrong in my compiling options. When i compile with :

icc -openmp -o heev_omp main_8_complex_openmp.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_lapack -lrt -lm

My program uses all cores but execution time is 20 % slower. I tried with all linking examples given by MKL user's guide but none worked out. I tried to set MKL_NUM_THREAD to 1 with OMP_NUM_THREAD still at 8 so that each function call uses 1 thread but again without result.

Still MKL's running 2 time slower on 1 core than my old program on 8 cores.

Gennady_F_Intel · ‎04-30-2010

nostradamus,

I'dguessthat's because you are working with the smallmatricessize.

--Gennady

nostradamus · ‎05-01-2010

Hello Gennady,

I cant understand why. When i look at the exec times, with 1 core my 8 heev (128x128) are done in 29 ms, so about 3,5 ms per heev. This time is quite stable when i put those 8 heev in a loop. When using 8 cores, the time for 8 heev, 1 on each core, is about 40 ms but it's not stable at all. Sometimes it's 7 ms ( less than 1 ms per heev) and sometimes 60 ms. That's why I think there's a problem.

Before using MKL, I had my own heev code based on lapack functions. 8 heev took 280 ms running on 1 core but when setting 1 heev on 1 thread with openMP, time is 44 ms (the 6.35 speedup).

Using intel compiler made a x3.6 speedup on both the one threaded and the openMP program. So now, i'm currently running FASTER with my old program+icc than with Intel MKL

I tried KMP_AFFINITY=granularity=fine,compact in my environement variables but no changes.

It's too bad if I can't use all the cores. We'are going to receive a new computer with 2 xeon X5670 and I won't be able to use the 12 physical cores :(

Evgueni_P_Intel · ‎05-03-2010

nostradamus,

Performance depends on placement of the input and output arrays in the cache. By default, compilers conserve memory and don't align the arrays to the cache line boundary (64 bytes.) However, you can do it yourself easily. If you allocate the arraysat runtime, use mkl_malloc(N, 64) to request memory. If the arrays are local variables, put __decspec(align(64)) before the declarations.

Performance also depends on presence of the input array in the cache of each CPU core. To stabilize the state of the cache from run to run, you may try to call dsyev once on each CPU core before you start actual timing. Though this is expected to have less effect since you mentioned that you already call dsyev in a loop.

If you think that there's a problem namely with LAPACK performance, rather than with performance stability in general, than please post a reproducer here...

nostradamus · ‎05-04-2010

Hello Evgueni,

That was my mistake !! I created 8 work arrays but i used the same array in each function call so every heev was trying to write into the same array and that caused these bad timings.

Once corrected, i got 208 ms for 180 zheevs which is even better than what i expected.

What you wrote is very interessting, i'm gonna try it to see the impact on performances.

Thank you :)))