I tried 11.3 update 2 and it is almost the same as 11.0.3. For my project, I have done threading in the beginning. Therefore, when cblas_zgemm is called, I want it to be limited in 1 thread. I put mkl_set_num_threads(1) before cblas_zgemm, but it does not help the speed.
From the release node, I noticed that cblas_zgemm is optimized for different flavors such as native execution, automatic offload etc. I want to know how to call cblas_zgemm in different flavor. Which one is the same one in MKL10.3.6 ? I would like to have the same one and use it in MKL11.0.3 for a speed check.
Could you give more details about different performance between 11.3 and 10.3.6?
What is the problem size? and what is the CPU you are working on?
Setting mkl_set_num_threads(1) before cblas_zgemm call should help to use only single thread by zgemm. You may try check it by using mkl_verbose() option ( function or env. variable. This option available in MKL 11.3 but not exists in 10.3 )
and one more question - have you compared the performance with 1 threads?
The matrix size in my project is not big, like 400 x 400, but there are many matrix multiplication and inversion in several level of loops. I use our own profiler to record the time before and after each matrix multiplication. It shows that 11.3 is around 10% slower than 10.3 consistently. Our station uses Intel X5550 CPUs.
We usually use vs2010 with Intel XE 2011 Update3. We recently update to vs2013 with Intel XE 2013 SP1 Update3 and I met this slow issue. Is it possible that optimization of cblas_zgemm etc. in MKL 11.3 may cause this slow issue on old CPUs like Intel X5550?
That's strange because of the same code branch (SSE4,2 )executes in both of this versions and I think this 10% is looks like problem with measuring the execution time. Could you please add into your code mkl_get_version( MKLVersion* pVersion ) function?