Gennady, thanks a lot for

Igor_C_Intel · ‎02-13-2017

Hello,

Calling LAPACKE_sgesdd for different size of the input matrix, I've noticed that starting from some dimension, computations run in a single thread.

Attached is a code that calls the function for a matrix filled with random numbers uniformly drawn from [0, 1] and measures execurion time.
The project archive is available at my Google Drive.

For a matrix with 10000 columns there is a sharp performance decrease when the number of rows reaches 9000. This effect does not appear, if the same code is compiled with Intel compiler. Is there any way to make the code work with MS compiler too?

>SVDProblem.exe 8000 10000
Time taken: 57.906423 s.

>SVDProblem.exe 8500 10000
Time taken: 63.765770 s.

>SVDProblem.exe 9000 10000
Time taken: 257.664138 s.

Hardware:
Intel Core i7-6950X, 64 GB RAM

Software:
MKL 2017 Update 1 (statically linked mkl_core.lib, mkl_intel_lp64.lib, mkl_intel_thread.lib)
VisualStudio2015 Update3, Intel Compiler 17.0 (libiomp5md.lib is statically linked, libiomp5md.dll is copied to the binary folder)
Windows 7 Enterprise Service Pack 1

Thank you!
Igor

Gennady_F_Intel · ‎02-13-2017

thanks Igor, we will gave a look at the problem asap

Gennady_F_Intel · ‎02-13-2017

Igor, checked the behavior on two systems available right now: 2 and 24 threads. I only added mkl_version and mkl_get_max_threads routines to report some needed details:

below what I see on my side:

_cl.exe 8500 10000

Major version: 2017
Minor version: 0
Update version: 1
Product status: Product
Build: 20161005
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

n_rows = 8500
n_columns = 10000
MKL #threads == 24
Time taken: 77.126904 s.

_cl.exe 9000 10000

MKL #threads == 24

Time taken: 82.861420 s.

cl version
Microsoft (R) C/C++ Optimizing Compiler Version 18.00.21005.1 for x64

Gennady_F_Intel · ‎02-13-2017

and the similar with 2 threads

_cl.exe 8500 10000
Major version: 2017
Minor version: 0
Update version: 1
Product status: Product
Build: 20161005
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors
================================================================

n_rows = 8500
n_columns = 10000
MKL #threads == 2
Time taken: 323.315681 s.

_cl.exe 9000 10000

n_rows = 9000

n_columns = 10000
MKL #threads == 2
Time taken: 375.469913 s.

Igor_C_Intel · ‎02-13-2017

Gennady, thanks a lot for prompt answer.
I inserted a call of MKL_Get_Max_Threads routine to my code and the problem disappeared.

After some experiments...
if MKL_Get_Max_Threads is called at the start, it returns 10 and SVD uses 10 threads.
if MKL_Get_Max_Threads is called just before LAPACKE_sgesdd call, it returns 1 and calculations
are performed using a single thread.

Debugger shows no threads are created till LAPACKE_sgesdd function call in both cases,
so race condition is excluded. Can it be attributed to unspecified order of static variables initialization in MKL libraries?

Also, the problem seems to be very uncommon... laptop, another desktop and even a virtual machine installed on
the problematic desktop work flawlessly. I'm going to try it on peers' computers and share an update. Anyway, I have a working
solution now (call MKL_Get_Max_Threads in advance), so the problem is not urgent anymore.

P.S.
MKL version: Intel(R) Math Kernel Library Version 2017.0.1 Product Build 20161005 for Intel(R) 64 architecture applications
Compiler: Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64

Igor_C_Intel · ‎02-13-2017

I've just found a similar symptom description at
https://svn.artisynth.org/svn/artisynth_core/trunk/src/artisynth/core/driver/Main.java :

/**
    * On Windows, we have sometimes seen that Pardiso getNumThreads() needs to
    * be called early, or otherwise the maximum number of threads returned by
    * mkl_get_max_threads() becomes fixed at 1. In particular, we seem to have
    * to do this before models are loaded.
*/

Gennady_F_Intel · ‎02-14-2017

Igor, I still couldn't reproduce the issue on my side on different systems available. But, i use

cl version 18.00.21005.1 for x64. I see only this difference. I will ask owner of this code to help. we will keep you updated. Thanks for the case.

LAPACKE_sgesdd stops using threads for 10k x 10k matrix