- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a problem when I compile with icc whereas it works with gcc. I only use one function Dsyev. With icc the program return a segmentation fault error. I wonder if this comes from my compilation options :
icc -o Dsyev Dsyev.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lrt -lm
gcc -o Dsyev Dsyev.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lrt -lm
I had no such problem when using icc for non intel mkl functions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I noticed that execution time of my program is not different between intel compiler and gcc when i use intel MKL functions. Are these functions are already using sse (i'm new to icc and intel mkl) ??
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My program runs very fast (i use one heev function) using one core of one of my 2 CPU. As i work on small matrices, i set one heev on one of the 8 cores on my old program (without MKL) with pragma sections. I had a speedup of about 6,5.
Now i'd like to do the same with MKL but something must be wrong in my compiling options. When i compile with :
icc -openmp -o heev_omp main_8_complex_openmp.c -I${DIR_EVD} -L${MKLROOT} -I${INCLUDE} -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lmkl_lapack -lrt -lm
My program uses all cores but execution time is 20 % slower. I tried with all linking examples given by MKL user's guide but none worked out. I tried to set MKL_NUM_THREAD to 1 with OMP_NUM_THREAD still at 8 so that each function call uses 1 thread but again without result.
Still MKL's running 2 time slower on 1 core than my old program on 8 cores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I cant understand why. When i look at the exec times, with 1 core my 8 heev (128x128) are done in 29 ms, so about 3,5 ms per heev. This time is quite stable when i put those 8 heev in a loop. When using 8 cores, the time for 8 heev, 1 on each core, is about 40 ms but it's not stable at all. Sometimes it's 7 ms ( less than 1 ms per heev) and sometimes 60 ms. That's why I think there's a problem.
Before using MKL, I had my own heev code based on lapack functions. 8 heev took 280 ms running on 1 core but when setting 1 heev on 1 thread with openMP, time is 44 ms (the 6.35 speedup).
Using intel compiler made a x3.6 speedup on both the one threaded and the openMP program. So now, i'm currently running FASTER with my old program+icc than with Intel MKL
I tried KMP_AFFINITY=granularity=fine,compact in my environement variables but no changes.
It's too bad if I can't use all the cores. We'are going to receive a new computer with 2 xeon X5670 and I won't be able to use the 12 physical cores :(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
nostradamus,
Performance depends on placement of the input and output arrays in the cache. By default, compilers conserve memory and don't align the arrays to the cache line boundary (64 bytes.) However, you can do it yourself easily. If you allocate the arraysat runtime, use mkl_malloc(N, 64) to request memory. If the arrays are local variables, put __decspec(align(64)) before the declarations.
Performance also depends on presence of the input array in the cache of each CPU core. To stabilize the state of the cache from run to run, you may try to call dsyev once on each CPU core before you start actual timing. Though this is expected to have less effect since you mentioned that you already call dsyev in a loop.
If you think that there's a problem namely with LAPACK performance, rather than with performance stability in general, than please post a reproducer here...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That was my mistake !! I created 8 work arrays but i used the same array in each function call so every heev was trying to write into the same array and that caused these bad timings.
Once corrected, i got 208 ms for 180 zheevs which is even better than what i expected.
What you wrote is very interessting, i'm gonna try it to see the impact on performances.
Thank you :)))

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page