I am facing performance issues with the function dgesvd when running in 64bit with AVX2 (MKL_CBWR=AVX2)
For some sizes of matrix the SVD duration is 25 times longer in 64bit than in 32bit !
You may reproduce with the test in attachment. On my side I get thoses durations for 1 svd on an mXn matrix:
There is no problem with MKL_CBWR=AVX.
Could you please have a look ?
Indeed I forgot to precise : I am using the sequential mode.
Here are the ouputs with MKL_VERBOSE=1 for one svd:
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 2.70GHz cdecl sequential MKL_VERBOSE DGESVD(A,A,103,103,0000000000FE5E40,103,0000000000FFAA40,0000000000FFAE00,103,000000000100FA00,103,0000000000CFF520,-1,0) 147.89us CNR:AVX2 Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DGESVD(A,A,103,103,0000000001055300,103,0000000000FFAA40,0000000001069F80,103,000000000107EB80,103,000000000104E200,3605,0) 112.26ms CNR:AVX2 Dyn:1 FastMM:1 TID:0 NThr:1
MKL_VERBOSE Intel(R) MKL 2019.0 Update 4 Product build 20190411 for 32-bit Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 2.70GHz sequential MKL_VERBOSE DGESVD(A,A,103,103,01045D40,103,0105A900,0105ACC0,103,0106F8C0,103,004FF604,-1,0) 116.49us CNR:AVX2 Dyn:1 FastMM:1 TID:0 NThr:1 MKL_VERBOSE DGESVD(A,A,103,103,010B5180,103,0105A900,010C9D80,103,010DE980,103,010AE000,3605,0) 4.64ms CNR:AVX2 Dyn:1 FastMM:1 TID:0 NThr:1
yes, I see ~ the same performance problem when linking with mkl_sequential lib. The gap is about 15 times for this specific problem sizes.
32 bit : [ PERF --> ] 0.004 clock for 1 iteration
64 bit : [ PERF --> ] 0.062 clock for 1 iteration
the Ratio is ~ 15 times
but there is no problem when linking with the threaded version of MKL ( 2019.4)
In the case, if the optimization for this specific problem sizes and ia32 version of MKL is important to you, could you please submit the request to the intel online service center to further communication internally.
Ok, thanks. Here is the ticket : 04232883
Please note that this behaviour may be observed for many other sizes of matrices: 160x160, 200x200, 302x302, ...
I add in attachment an Excel file containing the comparison 32 vs 64 bit of the svd duration for matrices extracted from real use-cases of my production.
My point of view: There is a performance issue in 64bit and AVX2 for the svd. We do not need any problem sized specials optimizations. We just need to have as good performances in 64bit as in 32bit, never mind the size of the matrix ;)