topic LU factorisation with OpenMP threaded functions(dgetrf) de MKL in Intel® oneAPI Math Kernel Library

LU factorisation with OpenMP threaded functions(dgetrf) de MKL

cfd_jinx — Fri, 05 Aug 2011 14:42:21 GMT

Hi,

I have a little problem but i cannot find any solutions after hours of searching on Internet. Maybe i misunderstood some concepts.

I want to increase the speed of LU factorisation of a matrix A of a system A*x=b. With the OpenMP threaded version of DGETRF, if i don't misunderstand, this is included in the MKL. I don't know how to put it in my openmp code, and how to use it. This is what i do at the moment :

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
dim_tot = (M+1)*(N+1)

!$OMP PARALLEL PRIVATE(rang)

!$OMP SINGLE
nthr = OMP_GET_NUM_THREADS()
print*, "******"
print*, "Nomber of threads being used = ", nthr
!$OMP END SINGLE

rang = OMP_GET_THREAD_NUM()
print *,"Thread No.",rang," is working !"

!! Decomposition LU
call DGETRF( dim_tot, dim_tot, G, dim_tot, IPIVG, INFOG )

!$OMP END PARALLEL
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

The Makefile looks like :

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

MKLPATH = /opt/intel/composerxe-2011.2.137/mkl/lib/intel64
MKLINCLUDE = /opt/intel/composerxe-2011.2.137/mkl/include

# Compilateur
IFORT = ifort

IOPT = -c -openmp

# Linkers
ILINK = -L$(MKLPATH) -I$(MKLINCLUDE) -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread

# edition de liens
code_omp : code_omp.o
$(IFORT) -o code_omp code_omp.o $(ILINK)

# compilation
code_omp.o : code_omp.f90
$(IFORT) $(IOPT) code_omp.f90
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

I think the LU factorisation ("dgetrf") asks more time than the solving precedure ("dgetrs"), no?
That's why i want to threading the "dgetrf", instead of "dgetrs".
Can someone give me some inspirations?

-- Xin

LU factorisation with OpenMP threaded functions(dgetrf) de MKL

TimP — Fri, 05 Aug 2011 15:57:04 GMT

According to my understanding of the way you set this up, you are asking each thread to execute dgetrf, which could be an excellent strategy provided that the argument arrays are distinct in each thread.. If you don't set omp nested, which defaults off, MKL will not start new threads inside dgetrf.
If you want dgetrf to work on a single data set, using additional threads internal to itself, you would call it outside a parallel region, as Todd said.
If you don't have enough parallel cases to use all your cores, you could set omp nested; you would want to give each problem its own contiguous group of cores, using both the OpenMP and MKL settings for thread numbers suggested by Todd. You would want to get it working first without omp nested so as to have a basis for comparison.

LU factorisation with OpenMP threaded functions(dgetrf) de MKL

Todd_R_Intel — Fri, 05 Aug 2011 16:00:14 GMT

Hello Xin,

The function DGETRF is in Intel MKL and has been threaded so that you can use it from your program and get parallelism without the use of any OpenMP*directives in your code. All you need to do is call DGETRF and since threading is turned on to use as many cores as are available you should see MKL parallelism at work!

Note: Threading may not be used if the matrix is too small to be efficiently divided among threads. You can use the omp_set_num_threads() or mkl_set_num_threads() functions to change the number of threads and see if you can note the change in performance or see how the load changes in a performance monitoring tool.

Todd

LU factorisation with OpenMP threaded functions(dgetrf) de MKL

cfd_jinx — Thu, 11 Aug 2011 09:50:18 GMT

Thanks a lot for both of you for the quick responses.

I've re-tried the library again today. Seems that nothing has to be changed in the code, but just the compilation's linker line.

After comparaisons, i found that the mkl improves A LOT the calculation performances. Gnial !

More details : with 8 processors, a matrix of 16200*16200 and one right-hand side, the total calculation time of (dgetrf+dgetrs) has been reduced 10 times. Normal?

Xin

LU factorisation with OpenMP threaded functions(dgetrf) de MKL

Dan4 — Thu, 11 Aug 2011 11:07:02 GMT

10 times faster than what? Are you comparing with the sequential MKL? Do you have 8 physical processors or you are using Hyper-Threading? Anyway, 10 times faster doesn't seem to be reasonable. In best case you will get linear speedup. What you have got is somehow super-linearity, but it happens very rarely and usually when you are processing large amount of data. Your matrix is not that large. You can also validate the results to make sure that you have called the routines correctly.

D.