Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

LAPACKE_sgesdd stops using threads for 10k x 10k matrix

Igor_C_Intel
Employee
372 Views

Hello,

Calling LAPACKE_sgesdd for different size of the input matrix, I've noticed that starting from some dimension, computations run in a single thread.

Attached is a code that calls the function for a matrix filled with random numbers uniformly drawn from [0, 1] and measures execurion time. 
The project archive is available at my Google Drive.

For a matrix with 10000 columns there is a sharp performance decrease when the number of rows reaches 9000. This effect does not appear, if the same code is compiled with Intel compiler. Is there any way to make the code work with MS compiler too?

>SVDProblem.exe 8000 10000
Time taken: 57.906423 s.

>SVDProblem.exe 8500 10000
Time taken: 63.765770 s.

>SVDProblem.exe 9000 10000
Time taken: 257.664138 s.

Hardware:
Intel Core i7-6950X, 64 GB RAM

Software:
MKL 2017 Update 1 (statically linked mkl_core.libmkl_intel_lp64.libmkl_intel_thread.lib)
VisualStudio2015 Update3, Intel Compiler 17.0 (
libiomp5md.lib is statically linked, libiomp5md.dll is copied to the binary folder)
Windows 7 Enterprise Service Pack 1

Thank you! 
Igor

 

0 Kudos
6 Replies
Gennady_F_Intel
Moderator
372 Views

thanks Igor, we will gave a look at the problem asap

0 Kudos
Gennady_F_Intel
Moderator
373 Views

Igor, checked the behavior on two systems available right now: 2 and 24 threads.  I only added mkl_version and mkl_get_max_threads routines to report some needed details:

below what I see on my side:

_cl.exe 8500 10000

Major version:           2017
Minor version:           0
Update version:          1
Product status:          Product
Build:                   20161005
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

n_rows = 8500
n_columns = 10000
 MKL #threads == 24
Time taken: 77.126904 s.

_cl.exe 9000 10000

MKL #threads == 24

Time taken: 82.861420 s.

cl version
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.21005.1 for x64

 

 

0 Kudos
Gennady_F_Intel
Moderator
373 Views

and the similar with 2 threads

_cl.exe 8500 10000
Major version:           2017
Minor version:           0
Update version:          1
Product status:          Product
Build:                   20161005
Platform:                Intel(R) 64 architecture
Processor optimization:  Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

n_rows = 8500
n_columns = 10000
 MKL #threads == 2
Time taken: 323.315681 s.

_cl.exe 9000 10000

n_rows = 9000

n_columns = 10000
 MKL #threads == 2
Time taken: 375.469913 s.

0 Kudos
Igor_C_Intel
Employee
373 Views

Gennady, thanks a lot for prompt answer.
I inserted a call of 
MKL_Get_Max_Threads routine to my code and the problem disappeared.

After some experiments... 
if MKL_Get_Max_Threads is called at the start, it returns 10 and SVD uses 10 threads.
if MKL_Get_Max_Threads is called just before LAPACKE_sgesdd call, it returns 1 and calculations
are performed using a single thread. 

Debugger shows no threads are created till LAPACKE_sgesdd function call in both cases,
so race condition is excluded. Can it be attributed to unspecified order of static variables initialization in MKL libraries?

Also, the problem seems to be very uncommon... laptop, another desktop and even a virtual machine installed on
the problematic desktop work flawlessly. I'm going to try it on peers' computers and share an update. Anyway, I have a working
solution now (call 
MKL_Get_Max_Threads in advance), so the problem is not urgent anymore.

P.S. 
MKL version: Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(R) 64 architecture applications
Compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64

 

 

0 Kudos
Igor_C_Intel
Employee
373 Views

I've just found a similar symptom description at 
https://svn.artisynth.org/svn/artisynth_core/trunk/src/artisynth/core/driver/Main.java :

/**
    * On Windows, we have sometimes seen that Pardiso getNumThreads() needs to
    * be called early, or otherwise the maximum number of threads returned by
    * mkl_get_max_threads() becomes fixed at 1. In particular, we seem to have
    * to do this before models are loaded.
*/

 

 

 

 

0 Kudos
Gennady_F_Intel
Moderator
373 Views

Igor, I still couldn't reproduce the issue on my side on different systems available. But, i use 

cl version  18.00.21005.1 for x64. I see only this difference. I will ask owner of this code to help. we will keep you updated. Thanks for the case. 

0 Kudos
Reply