Potential Issue with MKL 11 update 4 with SVD functions on 64-bit windows - Page 2

Murray_Shattuck · ‎06-18-2013

Recently, upon updating our MKL from version 10 to the latest MKL 11 update 4, we noted performance slowdowns in our application compiled for a 64-bit application using Visual Studio 2012. Upon profiling the application, the location of the slowdowns, seems to stem from the ZBDSQR call in the function below. This slow down does NOT occur when using a 32-bit Release build using Visual Studio 2012. Is this slowdown an existing defect in the MKL 11.0.4 release? Are there plans to address this issue? Thanks in advance for your response. ------------------------------------------------------------------------------------------------------------------------------------------------------- Sample Code Snippet where the problem was detected ------------------------------------------------------------------------------------------------------------------------------------------------------- Note in the test case being profiled MA = NA = 60, complex == MKL_Complex16 void LaSVD(complex *A, complex *U, double *S, complex *VT, int MA, int NA) { char UPLO=(MA>=NA ? 'U':'L'), VECTU='Q', VECTV='P'; int NCC=0, LDA=MA, NRU=MA, LDU=MA, NCVT=NA, LDVT=NA, LDC=1, LWORK=16*(MA+NA), INFO; int NB=(MA>=NA ? NA:MA), SIZEU=MA*NA, SIZEVT=MA*NA, INCX=1, INCY=1, KU=NA, KV=MA; double *RWORK, *D, *E; complex *WORK, *TAUQ, *TAUP, *C=0; int charlen; RWORK=new double[4*NA]; D=new double[MA+NA]; E=new double[MA+NA]; WORK=new complex[LWORK]; TAUQ=new complex[MA+NA]; TAUP=new complex[MA+NA]; //zgebrd_(&MA, &NA, A, &LDA, D, E, TAUQ, TAUP, WORK, &LWORK, &INFO); //Reduces a general matrix to bidiagonal form. GetLAPack64()->ZGEBRD(&MA, &NA, A, &LDA, D, E, TAUQ, TAUP, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZGEBRD"); //f2c_zcopy(&SIZEU, A, &INCX, U, &INCY); GetLAPackBlas()->ZCOPY(&SIZEU, A, &INCX, U, &INCY); charlen=1; //zungbr_(&VECTU, &MA, &NA, &KU, U, &LDA, TAUQ, WORK, &LWORK, &INFO); //Generates the complex unitary matrix Q or PH determined by ?gebrd. GetLAPack64()->ZUNGBR(&VECTU, &MA, &NA, &KU, U, &LDA, TAUQ, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZUNGBR"); //f2c_zcopy(&SIZEVT, A, &INCX, VT, &INCY); GetLAPackBlas()->ZCOPY(&SIZEVT, A, &INCX, VT, &INCY); //zungbr_(&VECTV, &NA, &NA, &KV, VT, &LDA, TAUP, WORK, &LWORK, &INFO); //Generates the complex unitary matrix Q or PH determined by ?gebrd. GetLAPack64()->ZUNGBR(&VECTV, &NA, &NA, &KV, VT, &LDA, TAUP, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZUNGBR"); //zbdsqr_(&UPLO, &NB, &NCVT, &NRU, &NCC, D, E, VT, &LDVT, U, &LDU, C, &LDC, RWORK, &INFO); //Computes the singular value decomposition of a general matrix that has been reduced to bidiagonal form. GetLAPack64()->ZBDSQR(&UPLO, &NB, &NCVT, &NRU, &NCC, D, E, VT, &LDVT, U, &LDU, C, &LDC, RWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZBDSQR"); charlen=NB; //f2c_dcopy(&charlen, D, &INCX, S, &INCY); GetLAPackBlas()->DCOPY(&charlen, D, &INCX, S, &INCY); delete[] TAUP; delete[] TAUQ; delete[] WORK; delete[] E; delete[] D; delete[] RWORK; } A similar slow down was detected when using the ZGESVD function. Same usage as described above.

Yuan_L_ · ‎07-25-2013

hi, Ying

Thanks for the update. We are glad that it is solved in the new update

Now about another issue, the sequential mkl vs one thread mkl. I tried to search online but did not find much benchmark against the two. Rather I found this from your website

http://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications

where it said "This case (MKL_NUM_THREADS = 1) is equivalent to linking with sequential MKL, that is, disable threading in MKL or linking with the threaded version on MKL but call mkl_set_num_thread( 1 )" So just for performance-wise, are they the similiar (equivalent).

On the other hand, people did mention some difference in time, but not how big the difference is and if justifiable to maintain another library.

http://stackoverflow.com/questions/17563552/how-to-use-simultaneous-of-parallel-and-serial-version-of-mkl

Do you have a clue of how big the performance improvement will be or if there are some benchmark results available?

Thanks.

Ying_H_Intel · ‎08-07-2013

Hi Yuan,

Theoretically speaking, yes, the MKL_NUM_THREADS=1 is same effect as sequential MKL. But as the discussion in stackoverfloww, once you call MKL parallel, whatever MKL_NUM_THREADS, the thread runtime OpenMP library will be needed, and threads manager need cpu resouce ( you can observe the threads in task manager, there are one more tread is created when a new OpenMP thread start). So it brings some affect. But generally, the affect is very small, for example, it may have 3.84 vs. 3.8s with 1000 loop. So we usually ignore the difference.

On the other hand, regarding the mix multi-thread envionment (like high level window thread or pthread, and OpenMP thread in sub-thread), there are many kind of issues (you can search in the forum). So if you have high-level threads and be sure mkl will used in 1 thread, then the sequential library should be more suitable.

Best Regards,
Ying