Hi Murray, Sergey,

Murray_Shattuck · ‎06-18-2013

Recently, upon updating our MKL from version 10 to the latest MKL 11 update 4, we noted performance slowdowns in our application compiled for a 64-bit application using Visual Studio 2012. Upon profiling the application, the location of the slowdowns, seems to stem from the ZBDSQR call in the function below. This slow down does NOT occur when using a 32-bit Release build using Visual Studio 2012. Is this slowdown an existing defect in the MKL 11.0.4 release? Are there plans to address this issue? Thanks in advance for your response. ------------------------------------------------------------------------------------------------------------------------------------------------------- Sample Code Snippet where the problem was detected ------------------------------------------------------------------------------------------------------------------------------------------------------- Note in the test case being profiled MA = NA = 60, complex == MKL_Complex16 void LaSVD(complex *A, complex *U, double *S, complex *VT, int MA, int NA) { char UPLO=(MA>=NA ? 'U':'L'), VECTU='Q', VECTV='P'; int NCC=0, LDA=MA, NRU=MA, LDU=MA, NCVT=NA, LDVT=NA, LDC=1, LWORK=16*(MA+NA), INFO; int NB=(MA>=NA ? NA:MA), SIZEU=MA*NA, SIZEVT=MA*NA, INCX=1, INCY=1, KU=NA, KV=MA; double *RWORK, *D, *E; complex *WORK, *TAUQ, *TAUP, *C=0; int charlen; RWORK=new double[4*NA]; D=new double[MA+NA]; E=new double[MA+NA]; WORK=new complex[LWORK]; TAUQ=new complex[MA+NA]; TAUP=new complex[MA+NA]; //zgebrd_(&MA, &NA, A, &LDA, D, E, TAUQ, TAUP, WORK, &LWORK, &INFO); //Reduces a general matrix to bidiagonal form. GetLAPack64()->ZGEBRD(&MA, &NA, A, &LDA, D, E, TAUQ, TAUP, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZGEBRD"); //f2c_zcopy(&SIZEU, A, &INCX, U, &INCY); GetLAPackBlas()->ZCOPY(&SIZEU, A, &INCX, U, &INCY); charlen=1; //zungbr_(&VECTU, &MA, &NA, &KU, U, &LDA, TAUQ, WORK, &LWORK, &INFO); //Generates the complex unitary matrix Q or PH determined by ?gebrd. GetLAPack64()->ZUNGBR(&VECTU, &MA, &NA, &KU, U, &LDA, TAUQ, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZUNGBR"); //f2c_zcopy(&SIZEVT, A, &INCX, VT, &INCY); GetLAPackBlas()->ZCOPY(&SIZEVT, A, &INCX, VT, &INCY); //zungbr_(&VECTV, &NA, &NA, &KV, VT, &LDA, TAUP, WORK, &LWORK, &INFO); //Generates the complex unitary matrix Q or PH determined by ?gebrd. GetLAPack64()->ZUNGBR(&VECTV, &NA, &NA, &KV, VT, &LDA, TAUP, WORK, &LWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZUNGBR"); //zbdsqr_(&UPLO, &NB, &NCVT, &NRU, &NCC, D, E, VT, &LDVT, U, &LDU, C, &LDC, RWORK, &INFO); //Computes the singular value decomposition of a general matrix that has been reduced to bidiagonal form. GetLAPack64()->ZBDSQR(&UPLO, &NB, &NCVT, &NRU, &NCC, D, E, VT, &LDVT, U, &LDU, C, &LDC, RWORK, &INFO); //if(INFO!=0) error_handler(NUMERICAL_ERROR, "ZBDSQR"); charlen=NB; //f2c_dcopy(&charlen, D, &INCX, S, &INCY); GetLAPackBlas()->DCOPY(&charlen, D, &INCX, S, &INCY); delete[] TAUP; delete[] TAUQ; delete[] WORK; delete[] E; delete[] D; delete[] RWORK; } A similar slow down was detected when using the ZGESVD function. Same usage as described above.

Ying_H_Intel · ‎06-19-2013

Hi Murray,

I check the mkl release notes C:\Program Files (x86)\Intel\Composer XE 2013\Documentation\en_US\mkl\Release_Notes.htm

and mkl bug list http://software.intel.com/en-us/articles/intel-mkl-110-bug-fixes/. seems no direct case.

Could you please provide some detials information, like

1. exact performance number of slowdown 10. vs. 11.0 update 4

2. how do you link mkl,dynamic or static, threaded or sequential.

3. what is the processor?

and a workable test case will be helpful.

Best Regards,

Ying

SergeyKostrov · ‎06-19-2013

Murray, Did you see how the source codes of your test case are re-formatted? It is absolutely Not readable.

SergeyKostrov · ‎06-20-2013

Hi Murray, Please attach a cpp-file with the source codes. Thanks in advance. Note: Please don't forget to press Start upload button before submitting your new post.

Murray_Shattuck · ‎06-20-2013

Ying H (Intel) wrote:
1. exact performance number of slowdown 10. vs. 11.0 update 4

Example for ZGESVD (63% slower) ( Will update on other functions at a later date)

MKL 11.0.4 ZGESVD Total Time - 15.47 sec, # Calls 8950, Ave Call Time 1.73msec, Min Call Time 0.89msec, Max Call Time 31.73mS

MKL 10 ZGESVD Total Time - 9.47 sec, # Calls 8950, Ave Call Time 1.06msec, Min Call Time 0.77msec, Max Call Time 12.06mS

Ying H (Intel) wrote:
2. how do you link mkl,dynamic or static, threaded or sequential.

a) We use dynamically Link to MKL

b) Our final application is threaded at a higher level than MKL function calls in this instance. Thus these MKL function calls should be using only one thread each. The timing data above was collected using all single threading both in the call application an MKL #1 threads is set to 1.

Ying H (Intel) wrote:
3. what is the processor?

Intel Core i7-3770K CPU @ 3.5GHz, 16GB RAM, Windows 7 Professional, Service Pack 1 64-bit

Ying H (Intel) wrote:
and a workable test case will be helpful.

I would like to create a dynamically linked console test project for you that reads the input data from a binary file. Unfortunately our application has MKL library locations in strange directly locations and is coupled into many other libraries. If you have an example dynamically linked console project set up in Visual Studio 2012 to test an MKL function, this would expedite the process significantly. I will collect the data for the binary file directly from out composite application. Please advise on this regard.

Thanks,

Murray

P.S. I will be on vacation for the next ~7 days. I will attempt to have another at my company respond in my absense.

SergeyKostrov · ‎06-20-2013

>>...If you have an example dynamically linked console project set up in Visual Studio 2012 to test an MKL function, this would >>expedite the process significantly... Consider a simple console-based and universal test project ( don't assume that everybody has VS 2012 ) because it could be compiled with any version of Intel or Microsoft C++ compilers. For example, my primary VSs are VS 2005 and 2008 and it takes just a couple tens of seconds to compile some small test project from the command line without overheads of VS projects.

SergeyKostrov · ‎06-20-2013

>>...Example for ZGESVD ( 63% slower )... I could verify it with MKL versions 10 and 11 on at least three platforms if you upload a test case. Also, 63% performance decrease doesn't look good and I have a concern that incorrect CPU dispatching DLLs of MKL were used. For example, on a Platform A with some CPU ( non AVX-like ) a SSE4 CPU dispatching DLL was used and on a Platform B with some CPU ( AVX-like ) a SSE CPU dispatching DLL was used instead of AVX CPU dispatching DLL.

Ying_H_Intel · ‎06-21-2013

Hi Murray, Sergey,

I create a zgesvd sample (A is 60x60) in MSVS 2012. I upload it here. Could you please test and let us know the result.

Best Regards,

Ying

SergeyKostrov · ‎06-21-2013

Hi Ying, >>...I create a zgesvd sample (A is 60x60) in MSVS 2012. I upload it here. Could you please test and let us know the result. Yes and I'll let you know test results for two MKL versions.

Ying_H_Intel · ‎07-09-2013

Hi Murray,

A new update release recently. If possible, could you please try it?

There is a well-known problem regarding SVD. which is fixed in the verson http://software.intel.com/en-us/forums/topic/401167.

Best Regards

Ying

Murray_Shattuck · ‎07-10-2013

Hi Ying,

Thanks for the update and your support. We did notice the latest release with the mention of the SVD in the notes and we are in the process of testing this to see if this resolves the problem.

As for a update on our end, we created some temporary code in our custom application to capture a sequence of data that exibits the problem within our custom application (Final.exe). We then created a stand alone test class (K_SVD_Test) that loads this file, performs the SVD operations and logs the time. Next we created a stand alone console program (SVD_Console.exe) that links to the MKL functions using an abstract interface implement using a separate dll ( VLAL_MKL.dll). This way of linking to MKL allows the majority of our developers to avoid building any of MKL/VLAL dependencies, rather we just connect to it via this custom dll. Upon running this console program (SVD_Console.exe) we noted that we could NOT reproduce the slow down we see in our custom application. Thus there was something else going on in our custom application (Final.exe) that was effecting the performance. After about a week of work, we were able to isolate another usage of MKL within our custom application (Final.exe) that appeared to trigger the undesired behavior. This second series of calls to MKL was also using the SVD functions (although a bit differently). A second test class K_SVD_Test2 was created along with the input data saved to file in a file. This test class was then used in the console program (SVD_Console.exe) and we were finally able to repeat the problem.

Essentially the console program (SVD_Console.exe) runs SVD_Test, then SVD_Test2, and finally SVD_Test again. The timing of SVD_Test before and after running SVD_Test2 show the previously described slow down when using MKL 11.0.04.1. MKL 10.3.10.1 does not have this problem.

As I was unable to obtain permission to post our custom linking to MKL ( VLAL_MKL.dll) on a public forum, I am unable to present (SVD_Console.exe and its associated source code) to you for verification on your end.

Presently, we are attempting to create a second console applicatiion (SVD_Console_Standalone.exe) which links directly with the MKL libraries and thus eliminates the custom method of connecting to MKL (VLAL_MKL.dll) mentioned above. We hope to have this new console program working in the next couple of days and then we can send you example (including all source code and input data files) that you could run on your machines. This last step would also eliminate any possible errors that may be caused by our custom access to the MKL functions via (VLAL_MKL.dll).

In parallel to this effort, we are also attempting to rebuild our custom application using the latest update of MKL (v11.0.5.). Unfortunately this effort is not under my direct control, thus I cannot guarentee an expedient response in this evaluation.

Best regards,

Murray

Ying_H_Intel · ‎07-10-2013

Hi Murray,

Thank you a lot for the details. You can upload the sample and dll by premier.intel.com, which is another official support channel. and the code and communication is IP protected

Best Regards,

Ying

Yuan_L_ · ‎07-16-2013

Hi, Ying

We created this console test engine by removing all the dependency of our envoriment. It contains two sets of test matrixes (binary) under \mtx directory.

After compiling the MKLconsole.sln, you will need to copy the mkl dlls to the \x64\release or \x64\debug directory. Mklconsole.exe will run a LaSVD, GESVD, GESDD on newLaSVD_Inputbak dataset, then run GESVD on RSVD_Inputbak dataset, and then re-run LaSVD, GESVD, GESDD on newLaSVD_Inputbak dataset. You will see the ~50% performance slowdown with 11.04 mkl.

Please let me know if you have any questions/comments.

Rgds

Yuan Liu

ps, the dll needed are

libimalloc.dll

libiomp5md.dll
mkl_avx.dll
mkl_avx2.dll
mkl_cdft_core.dll
mkl_core.dll
mkl_intel_thread.dll
mkl_mc.dll
mkl_mc3.dll
mkl_p4n.dll
mkl_vml_avx.dll
mkl_vml_avx2.dll
mkl_vml_cmpt.dll
mkl_vml_mc.dll
mkl_vml_mc2.dll
mkl_vml_mc3.dll
mkl_vml_p4n.dll

Ying_H_Intel · ‎07-16-2013

Hi Yuan,

It is nice test case. I can run it now. But not sure what is your exact mkl 10 version. Could you please add MKL version information and let me know the result?

Another question, you have test two times before_SVR and after_ SVR, what is the purpose? as you have high-level thread, the threaded mkl is not needed. how about if call mkl_sequential_dll.lib directly.

Best Regards,

Ying

// This will be called by the main function when testing SVD.
void Run_SVD_Tests()
{
MKLVersion Version;
mkl_get_version(&Version);
printf("Major version: %d\n",Version.MajorVersion);
printf("Minor version: %d\n",Version.MinorVersion);
printf("Update version: %d\n",Version.UpdateVersion);
printf("Product status: %s\n",Version.ProductStatus);
printf("Build: %s\n",Version.Build);
printf("Platform: %s\n",Version.Platform);
printf("Processor optimization: %s\n",Version.Processor);
printf("================================================================\n");
printf("\n");

SergeyKostrov · ‎07-17-2013

>>...we are in the process of testing this to see if this resolves the problem... Here are test results for your review: [ Test 1 - Intel Pentium 4 ( 1.60 GHz ) ] Command line to build the test case: icl.exe /O3 /Qmkl /MD lapacke_zgesvd_col.cpp ... LAPACKE_zgesvd (column-major, high-level) Example Program Results Major version: 10 Minor version: 3 Product status: Product Build: 20120831 Processor optimization: Intel(R) Pentium(R) 4 processor ================================================================ SVD takes : 21.252930 seconds ... [ Test 2 - Intel Core i7-3840QM ( 2.80 GHz ) ] Command line to build the test case: The same executable compiled for Test 1 was tested ... LAPACKE_zgesvd (column-major, high-level) Example Program Results Major version: 11 Minor version: 0 Product status: Product Build: 20130123 Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) Enabled Processor ================================================================ SVD takes : 3.842102 seconds ... [ Test 3 - Intel Core i7-3840QM ( 2.80 GHz ) ] Command line to build the test case: icl.exe /O3 /Qmkl /MD lapacke_zgesvd_col.cpp ... LAPACKE_zgesvd (column-major, high-level) Example Program Results Major version: 11 Minor version: 0 Product status: Product Build: 20130123 Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) Enabled Processor ================================================================ SVD takes : 3.519592 seconds ... Note: I don't have MKL version 10 on my Ivy Bridge system.

SergeyKostrov · ‎07-17-2013

>>...Note: I don't have MKL version 10 on my Ivy Bridge system. I just realized that I will be able to test MKL version 10 on the Ivy Bridge system and I'll post results as soon as the test is completed.

Murray_Shattuck · ‎07-17-2013

Hi Ying,

The previous version of MKL that we were using was v10.3.10.1. We then upgraded to v11.0.4.1 and noticed the degradation.

We were able to test the latest incremental release v11.0.5.1 and the problem appears to be resolved in this case. One should note that while the problem noted in http://software.intel.com/en-us/forums/topic/401167 was with the same SVD functions, the problem description was much different. In this case the slow down was noticed when using only one thread with MKL, while the case above involves multiple threads.

Just our luck that once we were able to produce a repeatable test case for you, that the problem would have been resolved.

I would like to spend a bit more time working with the console program to ensure that you are seeing the same problems that we are seeing on our end. Additional suggestions for the console test app would be welcome.

Thanks for you support!

Murray

Yuan_L_ · ‎07-17-2013

Hi, Ying

We intentionally set the thread number to 1 just to remove one uncertainty. In reality, we svd thousands or hundreds of thousands small matrixes and I guess we do need multi-threaded version of mkl. Please correct me if I am wrong.

Anyway, as you can see from the code, we svd the same test data twice (before and after), the test data are 8950 60-by-60 square complex matrixes. Between the two runs, we svd 15 rectangular matrixes of different dimensions (R_SVD).

Here is the result I got, you can see the performance of lasvd and gesvd slows down ~50% in mkl 11.0.4, whereas that of gesdd function is fairly consistent.

>>>>>>>>>>>>>>>>>>>>>>+++++++++++++++++++++++++++++++++++++++++++++++

Major version: 11

Minor version: 0

Update version: 4

Product status: Product

Build: 20130517

Platform: Intel(R) 64 architecture

Processor optimization: Intel(R) Core(TM) i7 Processor

================================================================

before R_SVD

lasvd time (ms) =17540 max (ms) =3 min (ms) =1 count =8950

gesvd time (ms) =18176 max (ms) =5 min (ms) =1 count =8950

gesdd time (ms) =14749 max (ms) =3 min (ms) =1 count =8950

R_SVD messing up around ~!~!~!

gesvd time (ms) =0 max (ms) =0 min (ms) =0 count =15

After R_SVD

lasvd time (ms) =23096 max (ms) =3 min (ms) =2 count =8950

gesvd time (ms) =26358 max (ms) =7 min (ms) =2 count =8950

gesdd time (ms) =15165 max (ms) =4 min (ms) =1 count =8950

+++++++++++++++++++++++++++++++++++<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

FYI, the result for 11.0.5 is much more consistent.

>>>>>>>>>>>>>>>>>>>>>>+++++++++++++++++++++++++++++++++++++++++++++++

Major version: 11

Minor version: 0

Update version: 5

Product status: Product

Build: 20130612

Platform: Intel(R) 64 architecture

Processor optimization: Intel(R) Core(TM) i7 Processor

================================================================

before R_SVD

lasvd time (ms) =17682 max (ms) =4 min (ms) =1 count =8950

gesvd time (ms) =17866 max (ms) =3 min (ms) =1 count =8950

gesdd time (ms) =15304 max (ms) =639 min (ms) =1 count =8950

R_SVD messing up around ~!~!~!

gesvd time (ms) =1 max (ms) =1 min (ms) =0 count =15

After R_SVD

lasvd time (ms) =17434 max (ms) =4 min (ms) =1 count =8950

gesvd time (ms) =18072 max (ms) =4 min (ms) =1 count =8950

gesdd time (ms) =14829 max (ms) =4 min (ms) =1 count =8950

+++++++++++++++++++++++++++++++++++<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Murray_Shattuck · ‎07-17-2013

Ying,

Quote from Ying H (Intel) Tue, 07/16/2013 - 20:24

"Another question, you have test two times before_SVR and after_ SVR, what is the purpose?"

This was our major problem when reproducing the problem in a console application. In our first attempt we only performed the tests before_SVR. When testing in this manner we did not see the associated slow down between the MKL versions. Yuan had spent quite a bit of time isolating yet another usage of the SVD functions within our final application which caused a subsequent tests after_SVR to become significantly slower when using MKL v11.0.4. This slowdown between the times before and after SVR did not show up when using MKL v10.3.10. It seems quite strange to us that performing the same series of tests within one application run results in such drastic changes in the time required to peform these tests.

We speculate that somehow when running the SVR tests, that the new version of MKL v11.0.4 gets in a bad state, that subequently slows down the second set of tests when compared to the 1st run of these tests.

Hope this helps,

Murray

Ying_H_Intel · ‎07-18-2013

Hi Yang, Murray,

Thanks you much for the clarification. The two issue seems related. The one in DPD20033524 is caused by error code in split workload on OpenMP threads . And the one here, same as you speculate, the first run change the second run's OpenMP thread status. I will check with our engineer and get back to you if any news.

@yuan, not sure exact your usage model, if thousand of matrix are small matrix and set mkl thread num=1, then sequential mkl should more suitable because at lease it save the time of manage openmp threads. Or you may need the threaded version to get performance by set mkl thread num = 2 or for other functions.

Best Regards,

Ying

Ying_H_Intel · ‎07-23-2013

Hi Yuan, Murry,

Our engineer comments regarding the performance drop at second run.

This seems to correlate with issue reported on forum (http://software.intel.com/en-us/forums/topic/373673) which was fixed in Update 5. Even diagnostic is different, there were problem with convergence in one of internal sub algotihms for SVD exposed in MKL 11.0.x line and fixed in MKL 11.0.5. That problem could lead to extra internal iterations (performance drop) or to error report (convergence not reached) depending on input matrix.

Best Regards,

Ying

Potential Issue with MKL 11 update 4 with SVD functions on 64-bit windows