LU factorisation with OpenMP threaded functions(dgetrf) de MKL
I have a little problem but i cannot find any solutions after hours of searching on Internet. Maybe i misunderstood some concepts.
I want to increase the speed of LU factorisation of a matrix A of a system A*x=b. With the OpenMP threaded version of DGETRF, if i don't misunderstand, this is included in the MKL. I don't know how to put it in my openmp code, and how to use it. This is what i do at the moment :
I think the LU factorisation ("dgetrf") asks more time than the solving precedure ("dgetrs"), no? That's why i want to threading the "dgetrf", instead of "dgetrs". Can someone give me some inspirations?
According to my understanding of the way you set this up, you are asking each thread to execute dgetrf, which could be an excellent strategy provided that the argument arrays are distinct in each thread.. If you don't set omp nested, which defaults off, MKL will not start new threads inside dgetrf. If you want dgetrf to work on a single data set, using additional threads internal to itself, you would call it outside a parallel region, as Todd said. If you don't have enough parallel cases to use all your cores, you could set omp nested; you would want to give each problem its own contiguous group of cores, using both the OpenMP and MKL settings for thread numbers suggested by Todd. You would want to get it working first without omp nested so as to have a basis for comparison.
The function DGETRF is in Intel MKL and has been threaded so that you can use it from your program and get parallelism without the use of any OpenMP*directives in your code. All you need to do is call DGETRF and since threading is turned on to use as many cores as are available you should see MKL parallelism at work!
Note: Threading may not be used if the matrix is too small to be efficiently divided among threads. You can use the omp_set_num_threads() or mkl_set_num_threads() functions to change the number of threads and see if you can note the change in performance or see how the load changes in a performance monitoring tool.
10 times faster than what? Are you comparing with the sequential MKL? Do you have 8 physical processors or you are using Hyper-Threading? Anyway, 10 times faster doesn't seem to be reasonable. In best case you will get linear speedup. What you have got is somehow super-linearity, but it happens very rarely and usually when you are processing large amount of data. Your matrix is not that large. You can also validate the results to make sure that you have called the routines correctly.