I recompiled my installation of numpy and scipy agaisnt Intel MKL. I am trying to speed up a script that fits tensor for DT MRI. The script bottlenecks during the svd operation, in particular the call to numpy.linalg.lapack_lite.dgesdd inside the svd function is very slow. Slow in the sense that I ran this calculation using the default ATLAS and now with MKL the speedup is negiigible. The thing I noticed is like ATLAS, Intel MKL is only using one core for the bulk of the SVD calculations.
I found this topic
and he says because SVD is BLAS 2 it is usually single threaded but will be mutlithreaded on newer processors. I haev an Intel i7 2630QM(Sandy Architecture) processor and am running the latest Intel MKL build(11). Should I be experiencing multiple thread use and if so, how can I obtain that? I'm not sure what other information would be helpful to provide, but I can provide whatever you need to help me. Thanks in advance!
what exact singular value decompostion routines you are using? some of them are threaded some not and also would be interesting to know the typical problem size you are dealing with.
Do you mean the svd routine inside MKL? I mentioned the function being called in numpy is numpy.linalg.lapack_lite.dgesdd. I am not quite sure which function this links to in Intel MKL, but I think dgesdd is a standard LAPACK routine, or is that incorrect?
As for the problem size, one example was 1977359 by 56 for matrix size.
Which version of NumPy are you using? Why didn't you call numpy.linalg.svd? I don't know where numpy.linalg.lapack_lite.dgesdd came from. It's not found in the latest NumPy documentation.Please try numpy.linalg.svd.
In our own NumPy benchmarking, we see at least 1.9x speedup for SVD when replacing ATLAS with MKL in NumPy 1.6.2, on a quad-core Ivy Bridge CPU. Your mileage may vary but you should see noticeable performance improvement. You can grab our test code and have a try: http://software.intel.com/file/41177
I'm using numpy 1.7
The call to dgesdd occurs during my call to the DT mri library(dipy). It calls numpy.linalg.pinv to calculate a pseudo inverse and from there it does the svd computations. I looked at the source code and I am calling numpy.linalg.svd, but it calls dgesdd(which looks like a low level c subroutine)
At line 1315, it shows the branch where the assignment and subsequent call is made.
I'm going to try the benchmarks and downgrading to numpy 1.6 to see if that causes any changes.
I'm using numpy 1.7
I looked at the code and dgesdd is a low level subroutine called inside numpy.linalg.svd
If you go to line 1315 you can the branch where dgesdd is use to calculate the svd inside the numpy.linalg.svd function.
I'm making a call to a dt mri library(dipy) which calls numpy.linalg.pinv(psuedoinverse) which calculates the SVD. So I am using the function you mentiond. Does you benchmark make use of multiple cores? If not, then I guess it's just that mileage I get out of it is neglibile.
But I'm going to try numpy1.6 and see what happens. Ive asked a friend to try on a more powerful machine too.
So we are calling into the same SVD function.
Unless you linked with the sequential MKL when you built your NumPy, or you specified single thread execution at run time, by default it does use multipe cores. See the attached picture for our SVD performance chart. BTW, how did you link with MKL? Did you do something to the same effect of OMP_NUM_THREADS=1 at run time?
NumPy 1.7 and 1.6.2 should not be much different in terms of SVD performance. You'd better first run the benchmark I gave you on your NumPy 1.7 and see what happens. Note that our test used square matrices. Your matrix is very tall and skinny. This may make a difference. I'll be very interested in knowing.
I linked to mkl using the instructions from this page: http://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl
I'm certain they linked because when I run numpy.show_config() it shows mkl_rt as the library along with pthread. I've checked OMP_NUM_THREADS and the other environment variable both are blank which I believe is supposed to mean it'll use the default amount of physical cores, according to what I've googled. I'll run your benchmarks on mkl and the atlas configuration and tell you the difference. I didn't realize those benchmarks given were for square matrices, so that may very well be the issue.
One more thing I should mention, I don't think I was clear about this but I'm profiling the code, and it runs across multiple cores until I get to the SVD calculation, and then it just loads one core to 100% and sits there until the end of the program(the rest finishes very quickly).
I figured out the problem. It is somewhat specific to my problem and stupidity, but I'll post it anways in case somebody else has the same problem. It turns out that while my matrix size is very large, the library I'm using, dipy, actually breaks up the matrix in python code and runs calculations on it using a for loop. Since this is done in python code it is very slow, and the size I was actually running intel mkl on was over 2 million seperate 56 by 7 matrices....., so the neglible speedup and lack of proper core usage makes perfect sense(I think the marginal improvement I did see, probably says something in favor of Intel MKL). I'm going to try to go a few levels up throw the entire matrix and see what the results are. I'll post back when I do.