Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Problem when compiling with icc

nostradamus
Beginner
761 Views
Hello,

I have a problem when I compile with icc whereas it works with gcc. I only use one function Dsyev. With icc the program return a segmentation fault error. I wonder if this comes from my compilation options :

icc -o Dsyev Dsyev.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lrt -lm

gcc -o Dsyev Dsyev.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lrt -lm

I had no such problem when using icc for non intel mkl functions.
0 Kudos
1 Solution
TimP
Honored Contributor III
761 Views
Many things could happen. icc defaults to -O which resembles gcc -O3 -ffast-math -fno-cx-limited-range (making more demands on the robustness of your code), while gcc defaults to -O0. As a consequence, the icc build is likely to require more stack, so you may need to boost stack limit in your shell.

View solution in original post

0 Kudos
8 Replies
TimP
Honored Contributor III
762 Views
Many things could happen. icc defaults to -O which resembles gcc -O3 -ffast-math -fno-cx-limited-range (making more demands on the robustness of your code), while gcc defaults to -O0. As a consequence, the icc build is likely to require more stack, so you may need to boost stack limit in your shell.
0 Kudos
nostradamus
Beginner
761 Views
Thank you very much, it works now.

I noticed that execution time of my program is not different between intel compiler and gcc when i use intel MKL functions. Are these functions are already using sse (i'm new to icc and intel mkl) ??
0 Kudos
TimP
Honored Contributor III
761 Views
Performance of the MKL functions should be the same when called from code compiled by either compiler.
0 Kudos
nostradamus
Beginner
761 Views
Thank you. I have one last question about compiling :

My program runs very fast (i use one heev function) using one core of one of my 2 CPU. As i work on small matrices, i set one heev on one of the 8 cores on my old program (without MKL) with pragma sections. I had a speedup of about 6,5.

Now i'd like to do the same with MKL but something must be wrong in my compiling options. When i compile with :

icc -openmp -o heev_omp main_8_complex_openmp.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_lapack -lrt -lm

My program uses all cores but execution time is 20 % slower. I tried with all linking examples given by MKL user's guide but none worked out. I tried to set MKL_NUM_THREAD to 1 with OMP_NUM_THREAD still at 8 so that each function call uses 1 thread but again without result.

Still MKL's running 2 time slower on 1 core than my old program on 8 cores.
0 Kudos
Gennady_F_Intel
Moderator
761 Views
nostradamus,
I'dguessthat's because you are working with the smallmatricessize.
--Gennady
0 Kudos
nostradamus
Beginner
761 Views
Hello Gennady,

I cant understand why. When i look at the exec times, with 1 core my 8 heev (128x128) are done in 29 ms, so about 3,5 ms per heev. This time is quite stable when i put those 8 heev in a loop. When using 8 cores, the time for 8 heev, 1 on each core, is about 40 ms but it's not stable at all. Sometimes it's 7 ms ( less than 1 ms per heev) and sometimes 60 ms. That's why I think there's a problem.

Before using MKL, I had my own heev code based on lapack functions. 8 heev took 280 ms running on 1 core but when setting 1 heev on 1 thread with openMP, time is 44 ms (the 6.35 speedup).

Using intel compiler made a x3.6 speedup on both the one threaded and the openMP program. So now, i'm currently running FASTER with my old program+icc than with Intel MKL

I tried KMP_AFFINITY=granularity=fine,compact in my environement variables but no changes.

It's too bad if I can't use all the cores. We'are going to receive a new computer with 2 xeon X5670 and I won't be able to use the 12 physical cores :(

0 Kudos
Evgueni_P_Intel
Employee
761 Views

nostradamus,

Performance depends on placement of the input and output arrays in the cache. By default, compilers conserve memory and don't align the arrays to the cache line boundary (64 bytes.) However, you can do it yourself easily. If you allocate the arraysat runtime, use mkl_malloc(N, 64) to request memory. If the arrays are local variables, put __decspec(align(64)) before the declarations.

Performance also depends on presence of the input array in the cache of each CPU core. To stabilize the state of the cache from run to run, you may try to call dsyev once on each CPU core before you start actual timing. Though this is expected to have less effect since you mentioned that you already call dsyev in a loop.

If you think that there's a problem namely with LAPACK performance, rather than with performance stability in general, than please post a reproducer here...

0 Kudos
nostradamus
Beginner
761 Views
Hello Evgueni,

That was my mistake !! I created 8 work arrays but i used the same array in each function call so every heev was trying to write into the same array and that caused these bad timings.

Once corrected, i got 208 ms for 180 zheevs which is even better than what i expected.

What you wrote is very interessting, i'm gonna try it to see the impact on performances.

Thank you :)))
0 Kudos
Reply