<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Marko,  in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126554#M25285</link>
    <description>&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Hi Marko, &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;It is nice topic for discuss here.&amp;nbsp;&amp;nbsp; If possible, could you attach your test code, especially zgetri, zgetrf code, so we can test at same basic line? &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Some comments about your questions&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1. the matrix size 10x10 to 50x50. &lt;/SPAN&gt;&lt;SPAN lang="ZH-CN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Microsoft YaHei&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN; mso-bidi-font-family: &amp;quot;Microsoft YaHei&amp;quot;;"&gt;（&lt;/SPAN&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;max 200x200).&amp;nbsp; right, the matrix size seem&amp;nbsp;too small to optimize by any ways.&amp;nbsp; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;But as you tried,&amp;nbsp; Here&amp;nbsp; is&amp;nbsp;some tips to improve the performance or reduce overhead of MKL&amp;nbsp;way , &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1.1&amp;nbsp;may you please &amp;nbsp;try -DMKL_DIRECT_CALL_SEQ&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;&lt;A href="https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call"&gt;https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;&lt;A href="https://software.intel.com/en-us/mkl-linux-developer-guide-limitations-of-the-direct-call"&gt;https://software.intel.com/en-us/mkl-linux-developer-guide-limitations-of-the-direct-call&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1. 2. what kind of CPU are you using? &amp;nbsp;try sequential MKL or&amp;nbsp;set OMP_NUM_THREADS=1,2, 4 etc&amp;nbsp; and find the best performance. (as for small workload, it may not best solution to set max threads. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1.3 other tips, like array alignments etc.&amp;nbsp; &lt;A href="https://software.intel.com/en-us/mkl-linux-developer-guide-other-tips-and-techniques-to-improve-performance"&gt;https://software.intel.com/en-us/mkl-linux-developer-guide-other-tips-and-techniques-to-improve-performance&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;2. Regarding the exe size.&amp;nbsp;How do you link mkl in your code?&amp;nbsp;&amp;nbsp;Maybe&amp;nbsp;the later version add more CPU-specific optimized code thus bigger size.&amp;nbsp; You&amp;nbsp;may read the article &amp;nbsp;&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-linkage-and-distribution-quick-reference-guide"&gt;https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-linkage-and-distribution-quick-reference-guide&lt;/A&gt;&amp;nbsp; and find some suitable link model. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;3. the symmetric feature may or may not helps. The problem may be&amp;nbsp;caused by small matrix or less optimize in&amp;nbsp;MKL functions. Let's test if you&amp;nbsp;provide&amp;nbsp;the test code. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Best Regards&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Ying&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 12 Jun 2017 06:10:57 GMT</pubDate>
    <dc:creator>Ying_H_Intel</dc:creator>
    <dc:date>2017-06-12T06:10:57Z</dc:date>
    <item>
      <title>Small matrix speed optimization</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126553#M25284</link>
      <description>&lt;P&gt;Hello all,&lt;/P&gt;

&lt;P&gt;since I can run now the mkl library 2017, I have a couple of follow up questions that hopefully deserve a thread of their&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;own. I am doing some mode matching, and consequently I need matrix inversions on matrices of the order 10x10 up to&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;50x50 most of the time (the maximum size would be somewhere aroud 200x200 but very rarely, they will almost&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;exclusively be in the 10-50 range). I have optimized my non mkl parts of the code so they are under 10% of all the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;simulation time, so any speedup on the mkl functions would be greatly beneficial, if possible. Now I have set the&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;mkl_num_threads to max, release mode, ia32, O2 optimization, optimized for speed and so forth to make it as fast as I&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;can/know to make currently. I only have a couple of matrices to invert per frequency point (5 to 6), and the code must&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;execute one frequency point at a time. My questions are as follows:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;1) Is there a way to improve &amp;nbsp;the performance of the mkl functions in any way (by setting some flags in the program itself,&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;or in Visual studio, or am I missing some functions that are better in this situation, or something else completely), either in&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;the mkl 2017 or the old mkl 10.0.012 that I have? I am using cblas_zgemm, cblas_zdscal, zgetri, zgetrf, vzSqrt, cblas_zaxpy, but most of the time is spent in matrix inversion, so zgetri, zgetrf take most of the time. Are there better functions than these or can I set some additional flags to make them faster?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;2) Since the optimization is for speed and not size, I of course expected the output (exe) to be bigger, but can someone&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;explain and/or help me optimize the output size for the exe in mkl 2017, because it is 3x bigger than with the old mkl&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;10.0.012? This represents a small problem unfortunatelly, and if possible I would be very happy if it can be mittigated in&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;any way (without optimize for size)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;3) Some of the matrices are symmetrical, and I was hopping that the symmetrical versions of zgetri, zgetrf, that is zsytrf and zsytri, I would be able to theoretically get a 2x speedup, but for some reason the speed is the same. Is this expected? Are my matrices too small for any noticeable effect? Am I missing something? In both cases I feed the functions the full matrix to invert, and while I am debugging I can see that only half of the matrix elements are calculated for the symmetrical functions, and I fill the symmetric elements, but there is no speed up.&lt;/P&gt;

&lt;P&gt;Any information, even if not good is welcome. Thank you all in advance.&lt;/P&gt;</description>
      <pubDate>Sun, 11 Jun 2017 19:29:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126553#M25284</guid>
      <dc:creator>marko_m_</dc:creator>
      <dc:date>2017-06-11T19:29:20Z</dc:date>
    </item>
    <item>
      <title>Hi Marko,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126554#M25285</link>
      <description>&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Hi Marko, &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;It is nice topic for discuss here.&amp;nbsp;&amp;nbsp; If possible, could you attach your test code, especially zgetri, zgetrf code, so we can test at same basic line? &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Some comments about your questions&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1. the matrix size 10x10 to 50x50. &lt;/SPAN&gt;&lt;SPAN lang="ZH-CN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Microsoft YaHei&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN; mso-bidi-font-family: &amp;quot;Microsoft YaHei&amp;quot;;"&gt;（&lt;/SPAN&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;max 200x200).&amp;nbsp; right, the matrix size seem&amp;nbsp;too small to optimize by any ways.&amp;nbsp; &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;But as you tried,&amp;nbsp; Here&amp;nbsp; is&amp;nbsp;some tips to improve the performance or reduce overhead of MKL&amp;nbsp;way , &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1.1&amp;nbsp;may you please &amp;nbsp;try -DMKL_DIRECT_CALL_SEQ&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;&lt;A href="https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call"&gt;https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;&lt;A href="https://software.intel.com/en-us/mkl-linux-developer-guide-limitations-of-the-direct-call"&gt;https://software.intel.com/en-us/mkl-linux-developer-guide-limitations-of-the-direct-call&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1. 2. what kind of CPU are you using? &amp;nbsp;try sequential MKL or&amp;nbsp;set OMP_NUM_THREADS=1,2, 4 etc&amp;nbsp; and find the best performance. (as for small workload, it may not best solution to set max threads. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;1.3 other tips, like array alignments etc.&amp;nbsp; &lt;A href="https://software.intel.com/en-us/mkl-linux-developer-guide-other-tips-and-techniques-to-improve-performance"&gt;https://software.intel.com/en-us/mkl-linux-developer-guide-other-tips-and-techniques-to-improve-performance&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;2. Regarding the exe size.&amp;nbsp;How do you link mkl in your code?&amp;nbsp;&amp;nbsp;Maybe&amp;nbsp;the later version add more CPU-specific optimized code thus bigger size.&amp;nbsp; You&amp;nbsp;may read the article &amp;nbsp;&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-linkage-and-distribution-quick-reference-guide"&gt;https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-linkage-and-distribution-quick-reference-guide&lt;/A&gt;&amp;nbsp; and find some suitable link model. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;3. the symmetric feature may or may not helps. The problem may be&amp;nbsp;caused by small matrix or less optimize in&amp;nbsp;MKL functions. Let's test if you&amp;nbsp;provide&amp;nbsp;the test code. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Best Regards&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN lang="EN" style="color: rgb(83, 87, 94); font-family: &amp;quot;Arial&amp;quot;,sans-serif; font-size: 9.5pt; mso-ansi-language: EN;"&gt;Ying&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 12 Jun 2017 06:10:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126554#M25285</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2017-06-12T06:10:57Z</dc:date>
    </item>
    <item>
      <title>Hi Ying,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126555#M25286</link>
      <description>&lt;P&gt;Hi Ying,&lt;/P&gt;

&lt;P&gt;Thanks for all the suggestions. It is more out of curiosity, since there is a big difference between the mkl versions I am using. Essentially there is no noticeable difference in speed for me between mkl 10.0.012 and mkl 2017, but I was hoping that with potential new options there could be some new options that I don't know or have overlooked. As for the suggestions I have tried:&lt;/P&gt;

&lt;P&gt;1.1 When I add the&amp;nbsp;&lt;SPAN style="font-size: 12px;"&gt;MKL_DIRECT_CALL_SEQ or&amp;nbsp;MKL_DIRECT_CALL to the preprocessor I get some errors of the form "cannot convert argument k from complex&amp;lt;double&amp;gt; * to const MKL_Complex16 *". I have used some variables of complex double form as arguments for mkl functions but before using this preprocessor directive I did not get these errors, and the produced results were correct.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;1.2 My CPU is Intel Core i7-2670 QM @ 2.20 GHz. I have tried testing with different number of threads, but the max number always gave the smallest time, but only by a little bit, maybe there was a 15% difference between 1 thread and 4 threads, and 2 and &amp;nbsp;4 are almost the same.&lt;/P&gt;

&lt;P&gt;1.3 I have not tried this, but will try to see if it helps.&lt;/P&gt;

&lt;P&gt;2. It is static linking, but I only used the automatic inclusion &amp;nbsp;of mkl in Visual studio. I will try individual libraries, like in the old mkl version to see if I can reduce the size of the exe.&lt;/P&gt;

&lt;P&gt;My example code is bellow. The first function is to check if the matrix is symmetric, checks separately the real and imag part to a given tolerance, and the other function uses the first one to determine the input matrix symmetry and if symmetric uses the symmetric mkl functions.The third function is for matrix multiplication, two matrices in, one out, nothing special. There could be errors in my code, or total misunderstanding of what the functions should do, so if anybody has any complaints, feel free to express them. I did the visual studio profiler and diagnostics on my release version and 77.2% of the time is in libiomp5md.dll (I don't know if this can be optimized), and from the other 22.8%, about 16% is in zgemm, zgetrf and zgetri (for a 40x40 matrix test). This was done for 1000 frequency points, so to have some reasonable time measurement, so 77.2% and the other values are after 1000 calls to my top level function. In the matrix multiply function, there are const complex&amp;lt;double&amp;gt; alpha (1.0,0.0), beta(0.0,0.0).&lt;/P&gt;

&lt;P&gt;If I missed anything please ask, and thank you all in advance for the advices.&lt;/P&gt;

&lt;P&gt;Matrices are made with :&amp;nbsp; new MKL_Complex16[N*N], so maybe the alignment would do something.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;bool m_check_if_matrix_symmetric(MKL_Complex16 * matrix, int N){
	bool result = true, test_real_part, test_imag_part;
	int i, j;
	double num_real_1, num_real_2, eps = 1e-10;

	for (i = 0; i &amp;lt; N; i++){
		for (j = i+1; j &amp;lt; N; j++){
			num_real_1 = matrix[i*N + j].real;
			num_real_2 = matrix[j*N + i].real;

			test_real_part = fabs(num_real_1 - num_real_2) &amp;lt;= ((fabs(num_real_1) &amp;lt; fabs(num_real_2) ? fabs(num_real_2) : fabs(num_real_1)) * eps);

			num_real_1 = matrix[i*N + j].imag;
			num_real_2 = matrix[j*N + i].imag;

			test_imag_part = fabs(num_real_1 - num_real_2) &amp;lt;= ((fabs(num_real_1) &amp;lt; fabs(num_real_2) ? fabs(num_real_2) : fabs(num_real_1)) * eps);

			if (!test_real_part || !test_imag_part){
				result = false;
				break;
			}
		}
	}

	return result;
}

void m_matrix_inversion(MKL_Complex16 *&amp;amp; matrix, int N){
	int i, j;
	bool test_if_symm;

	int *NLi = new int(N), *MLi = new int(N), *ldaLi = new int(N), *ipivLi = new int&lt;N&gt;, *infoLi = new int(0);
	int lworkLi = -1;
	MKL_Complex16 *dummywork = new MKL_Complex16[1];

	test_if_symm = m_check_if_matrix_symmetric(matrix, N);

	if (test_if_symm){
		zsytrf("U", NLi, matrix, ldaLi, ipivLi, dummywork, &amp;amp;lworkLi, infoLi);
	}else{
		zgetri(NLi, matrix, ldaLi, ipivLi, dummywork, &amp;amp;lworkLi, infoLi);
	}

	lworkLi = (int)(dummywork[0].real);
	MKL_Complex16 *work = new MKL_Complex16[lworkLi];

	if (test_if_symm){
		zsytrf("U", NLi, matrix, ldaLi, ipivLi, work, &amp;amp;lworkLi, infoLi);
		zsytri("U", NLi, matrix, ldaLi, ipivLi, work, infoLi);

		for (i = 0; i &amp;lt; N; i++){
			for (j = i + 1; j &amp;lt; N; j++){
				matrix[i*N + j] = matrix[j*N + i];
			}
		}
	}else{
		zgetrf(MLi, NLi, matrix, ldaLi, ipivLi, infoLi);
		zgetri(NLi, matrix, ldaLi, ipivLi, work, &amp;amp;lworkLi, infoLi);
	}

	delete[] dummywork;
	delete[] work;
	delete NLi;
	delete MLi;
	delete ldaLi;
	delete[] ipivLi;
	delete infoLi;

	return;
}&lt;/N&gt;&lt;/PRE&gt;

&lt;PRE class="brush:cpp;"&gt;void m_matrix_multiply(MKL_Complex16 *&amp;amp; matrix1, MKL_Complex16 *&amp;amp; matrix2, MKL_Complex16 *&amp;amp; matrix_res, int M, int N, int P, char * order, char * trans1, char * trans2){
	CBLAS_TRANSPOSE tr1 = CblasNoTrans, tr2 = CblasNoTrans;
	CBLAS_ORDER ord;
	int prv, sec;
	if (trans1 == "N"){
		tr1 = CblasNoTrans;
		prv = P;
	}
	else if (trans1 == "T"){
		tr1 = CblasTrans;
		prv = M;
	}
	else if(trans1 == "C"){
		tr1 = CblasConjTrans;
		prv = M;
	}

	if (trans2 == "N"){
		tr2 = CblasNoTrans;
		sec = N;
	}
	else if (trans2 == "T"){
		tr2 = CblasTrans;
		sec = P;
	}
	else if (trans2 == "C"){
		tr2 = CblasConjTrans;
		sec = P;
	}

	if (order == "R"){
		ord = CblasRowMajor;
	}
	else{
		ord = CblasColMajor;
	}

	cblas_zgemm(ord, tr1, tr2, M, N, P, &amp;amp;alpha, matrix1, prv, matrix2, sec, &amp;amp;beta, matrix_res, sec);
	return;
}&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 12 Jun 2017 18:40:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126555#M25286</guid>
      <dc:creator>marko_m_</dc:creator>
      <dc:date>2017-06-12T18:40:00Z</dc:date>
    </item>
    <item>
      <title>Hi Marko,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126556#M25287</link>
      <description>&lt;P&gt;Hi Marko,&lt;/P&gt;

&lt;P&gt;thank you for the sharing. we will check them and the performance.&lt;/P&gt;

&lt;P&gt;Generally speaking, right&amp;nbsp;&amp;nbsp;it is&amp;nbsp;hard to see any obvious changes to improve to small matrix operation by any tips as the small matrix often means&amp;nbsp;too less computation to find room to optimization.&lt;/P&gt;

&lt;P&gt;Regarding zgemm,&amp;nbsp; there&amp;nbsp;are other functions seems save some computation time&amp;nbsp;theoretically&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="fontstyle0"&gt;&lt;STRONG&gt;&lt;FONT color="#0860a8"&gt;cblas_?gemm3m&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle1"&gt;&lt;EM&gt;&lt;FONT size="2"&gt;Computes a scalar-matrix-matrix product using matrix multiplications and adds the result to a scalar-matrix&lt;BR /&gt;
	product.&lt;/FONT&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle0" style="font-size: 11pt;"&gt;&lt;STRONG&gt;&lt;FONT color="#0860a8"&gt;Syntax&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle3"&gt;&lt;FONT size="2"&gt;void cblas_cgemm3m &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3" style="font-size: 9pt;"&gt;(&lt;/SPAN&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;const CBLAS_LAYOUT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;Layout&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const CBLAS_TRANSPOSE &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;transa&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;, const&lt;BR /&gt;
	CBLAS_TRANSPOSE &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;transb&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;m&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;n&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;k&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;, const void&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*alpha&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const void &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*a&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;lda&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const void &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*b&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;ldb&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;, const void&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*beta&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, void &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*c&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;ldc&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;SPAN class="fontstyle3" style="font-size: 9pt;"&gt;);&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle3"&gt;&lt;FONT size="2"&gt;void cblas_zgemm3m &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3" style="font-size: 9pt;"&gt;(&lt;/SPAN&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;const CBLAS_LAYOUT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;Layout&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const CBLAS_TRANSPOSE &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;transa&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;, const&lt;BR /&gt;
	CBLAS_TRANSPOSE &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;transb&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;m&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;n&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;k&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;, const void&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*alpha&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const void &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*a&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;lda&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const void &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*b&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;ldb&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;FONT size="2"&gt;&lt;SPAN class="fontstyle3"&gt;, const void&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*beta&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, void &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;*c&lt;/EM&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;, const MKL_INT &lt;/SPAN&gt;&lt;SPAN class="fontstyle4"&gt;&lt;EM&gt;ldc&lt;/EM&gt;&lt;/SPAN&gt;&lt;/FONT&gt;&lt;SPAN class="fontstyle3" style="font-size: 9pt;"&gt;);&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle0" style="font-size: 11pt;"&gt;&lt;STRONG&gt;&lt;FONT color="#0860a8"&gt;Include Files&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle5"&gt;&lt;FONT face="Verdana" size="2"&gt;• &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;&lt;FONT size="2"&gt;mkl.h&lt;/FONT&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle0" style="font-size: 11pt;"&gt;&lt;STRONG&gt;&lt;FONT color="#0860a8"&gt;Description&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle5" style="font-size: 9pt;"&gt;&lt;FONT face="Verdana"&gt;The &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3"&gt;&lt;FONT size="2"&gt;?gemm3m &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle5" style="font-size: 9pt;"&gt;&lt;FONT face="Verdana"&gt;routines perform a matrix-matrix operation with general complex matrices. These routines are&lt;BR /&gt;
	similar to the &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle3" style="color: rgb(8, 96, 168);"&gt;&lt;FONT size="2"&gt;?gemm &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle5" style="font-size: 9pt;"&gt;&lt;FONT face="Verdana"&gt;routines, but they use &lt;STRONG&gt;fewer matrix multiplication operations &lt;/STRONG&gt;(see &lt;/FONT&gt;&lt;/SPAN&gt;&lt;SPAN class="fontstyle1"&gt;&lt;EM&gt;&lt;FONT size="2"&gt;Application Notes&lt;/FONT&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle5" style="font-size: 9pt;"&gt;&lt;FONT face="Verdana"&gt;below).&lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT face="Verdana"&gt;or &lt;/FONT&gt;&lt;SPAN class="fontstyle0"&gt;&lt;STRONG&gt;&lt;FONT color="#0860a8"&gt;cblas_?gemmt&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle1"&gt;&lt;EM&gt;&lt;FONT size="2"&gt;Computes a matrix-matrix product with general&lt;BR /&gt;
	matrices but updates only the upper or lower&lt;BR /&gt;
	triangular part of the result matrix&lt;/FONT&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="fontstyle5" style="font-size: 9pt;"&gt;&lt;FONT face="Verdana"&gt;maybe you can try them. &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="fontstyle5" style="font-size: 9pt;"&gt;&lt;FONT face="Verdana"&gt;secondly, do you have many of such kind of small matrix to do the computation? &lt;/FONT&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;FONT face="Verdana"&gt;If yes, you may try the batch gemm function&lt;/FONT&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="fontstyle0"&gt;&lt;STRONG&gt;&lt;FONT color="#0860a8"&gt;cblas_?gemm3m_batch&lt;/FONT&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;SPAN class="fontstyle1"&gt;&lt;EM&gt;&lt;FONT size="2"&gt;Computes scalar-matrix-matrix products and adds the&lt;BR /&gt;
	results to scalar matrix products for groups of general&lt;BR /&gt;
	matrices.&lt;/FONT&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;A href="https://community.intel.com/legacyfs/online/drupal_files/managed/7b/36/Batch-DGEMM-vs-GEMM-xeon-processor.png"&gt;https://software.intel.com/sites/default/files/managed/7b/36/Batch-DGEMM-vs-GEMM-xeon-processor.png&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Ying&lt;/P&gt;</description>
      <pubDate>Thu, 15 Jun 2017 05:48:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126556#M25287</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2017-06-15T05:48:54Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...matrix inversions on</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126557#M25288</link>
      <description>&amp;gt;&amp;gt;...matrix inversions on matrices of the order 10x10 up to 50x50 most of the time (the maximum size would be somewhere
&amp;gt;&amp;gt;around 200x200 but very rarely, they will almost exclusively be in the 10-50 range)...

&lt;STRONG&gt;MKL&lt;/STRONG&gt; functions overheads when small matrices are multiplied are &lt;STRONG&gt;significant&lt;/STRONG&gt; and a classic matrix multiplication algorithm outperforms &lt;STRONG&gt;MKL&lt;/STRONG&gt;'s &lt;STRONG&gt;sgemm&lt;/STRONG&gt; for sizes up to &lt;STRONG&gt;2,048&lt;/STRONG&gt;x&lt;STRONG&gt;2,048&lt;/STRONG&gt;. I've spent some amount of time on evaluations and performance numbers could be reviewed at: &lt;A href="https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system" target="_blank"&gt;https://software.intel.com/en-us/articles/performance-of-classic-matrix-multiplication-algorithm-on-intel-xeon-phi-processor-system&lt;/A&gt;</description>
      <pubDate>Mon, 19 Jun 2017 22:25:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126557#M25288</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-06-19T22:25:13Z</dc:date>
    </item>
    <item>
      <title>Hi everybody,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126558#M25289</link>
      <description>&lt;P&gt;Hi everybody,&lt;/P&gt;

&lt;P&gt;First of all, thank you for your effort and advices.I made a sample test program that compares mkl's zgemm and my implementations of matrix multiplication. Unfortunately it seems that my program is at least 2 to 2.5 times slower than mkl (which is expected, but I was full of hope after Sergey wrote the last message). Anyway i will post the code, what options I used in visual studio and if anyone can spot any errors, or give me some advice how to possibly accelerate my code, i would be very grateful. Even if the answer is negative, that is also good. So here goes the code:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#include &amp;lt;complex&amp;gt;
#include &amp;lt;cmath&amp;gt;
#include &amp;lt;vector&amp;gt;
#include &amp;lt;omp.h&amp;gt;
#include &amp;lt;Windows.h&amp;gt;
#include &amp;lt;algorithm&amp;gt;
#include &amp;lt;iomanip&amp;gt;
#include &amp;lt;iostream&amp;gt;


extern "C" {//
#include"mkl_cblas.h"
#include"mkl_lapack.h"
#include "mkl_service.h"
#include "mkl.h"
}//
using namespace std;

const complex&amp;lt;double&amp;gt; alpha(1, 0), beta(0, 0);

#define CACHE_LINE  64  
#define CACHE_ALIGN __declspec(align(CACHE_LINE)) 

/************************************************************************//**
*  @brief         Checks if the matrices that are compared, are they the same
*
*  @param[in]     matrix1       First matrix to compare
*  @param[in]     matrix2       Second matrix to compare to first element by element
*  @param[in]     N             Square matrix size
*
*  @return                      true if they are same, false otherwise
*
*******************************************************************************/
bool check_if_matr_match(MKL_Complex16 *&amp;amp; matrix1, MKL_Complex16 *&amp;amp; matrix2, int N){
	int i, j;
	double matr_1_real, matr_1_imag, matr_2_real, matr_2_imag;

	bool result = true;

	for (i = 0; i &amp;lt; N; i++){
		for (j = 0; j &amp;lt; N; j++){
			matr_1_real = matrix1[i*N + j].real;
			matr_1_imag = matrix1[i*N + j].imag;

			matr_2_real = matrix2[i*N + j].real;
			matr_2_imag = matrix2[i*N + j].imag;

			if ((fabs(matr_1_real) &amp;gt; 1.0000001*fabs(matr_2_real)) &amp;amp;&amp;amp; ((fabs(matr_1_real) &amp;lt; 0.9999999*fabs(matr_2_real)))){
				return false;
			}

			if ((fabs(matr_1_imag) &amp;gt; 1.0000001*fabs(matr_2_imag)) &amp;amp;&amp;amp; ((fabs(matr_1_imag) &amp;lt; 0.9999999*fabs(matr_2_imag)))){
				return false;
			}
		}
	}
	return result;
}

/************************************************************************//**
*  @brief         Simple wrapper around mkl zgemm function
*******************************************************************************/

void m_matrix_multiply_mkl(MKL_Complex16 *&amp;amp; matrix1, MKL_Complex16 *&amp;amp; matrix2, MKL_Complex16 *&amp;amp; matrix_res, int M, int N, int P, char * order, char * trans1, char * trans2){
	CBLAS_TRANSPOSE tr1 = CblasNoTrans, tr2 = CblasNoTrans;
	CBLAS_ORDER ord;
	int prv = 0, sec = 0;
	if (string(trans1) == "N"){
		tr1 = CblasNoTrans;
		prv = P;
	}
	else if (string(trans1) == "T"){
		tr1 = CblasTrans;
		prv = M;
	}
	else if (string(trans1) == "C"){
		tr1 = CblasConjTrans;
		prv = M;
	}

	if (string(trans2) == "N"){
		tr2 = CblasNoTrans;
		sec = N;
	}
	else if (string(trans2) == "T"){
		tr2 = CblasTrans;
		sec = P;
	}
	else if (string(trans2) == "C"){
		tr2 = CblasConjTrans;
		sec = P;
	}

	if (string(order) == "R"){
		ord = CblasRowMajor;
	}
	else{
		ord = CblasColMajor;
	}

	cblas_zgemm(ord, tr1, tr2, M, N, P, &amp;amp;alpha, matrix1, prv, matrix2, sec, &amp;amp;beta, matrix_res, sec);
	return;
}

/************************************************************************//**
*  @brief         Simple wrapper around my matrix multiplication function
*
*  Bellow we have :
*    #define simple_mat_mul                                     this is the simplest implementation of a triple loop matrix multipication
*    #define simple_mat_mul_better_vars                         in my original program when I changed the the variables to be like this, I got a 40% increase in speed, but in this test program it doesn't appear to be the case
*    #define simple_mat_mul_transpose                           version where we first transpose the second matrix and then do the multiplicaton
*    #define simple_mat_mul_openmp                              like the transpose version but with openmp using 4 threads
*******************************************************************************/

//#define simple_mat_mul
//#define simple_mat_mul_better_vars
//#define simple_mat_mul_transpose
#define simple_mat_mul_openmp
void m_matrix_multiply_mine(MKL_Complex16 *&amp;amp; matrix1, MKL_Complex16 *&amp;amp; matrix2, MKL_Complex16 *&amp;amp; matrix_res, int M, int N, int P, char * order, char * trans1, char * trans2){
	int i, j, k;
#ifndef simple_mat_mul
	double temp_var_1, temp_var_2;
	double c1r, c1i, c2r, c2i;
	MKL_Complex16 *temp_mat = new MKL_Complex16[P*N];
#endif
	for (int i = 0; i &amp;lt; M; i += 1){
		for (int j = 0; j &amp;lt; N; j += 1){
			matrix_res[i*N + j].real = 0.0;
			matrix_res[i*N + j].imag = 0.0;
		}
	}

#ifdef simple_mat_mul
	for (i = 0; i &amp;lt; M; i += 1){
		for (j = 0; j &amp;lt; N; j += 1){
			for (k = 0; k &amp;lt; P; k += 1){
				matrix_res[i*N + j].real += matrix1[i*P + k].real * matrix2[k*N + j].real - matrix1[i*P + k].imag*matrix2[k*N + j].imag;
				matrix_res[i*N + j].imag += matrix1[i*P + k].real * matrix2[k*N + j].imag + matrix1[i*P + k].imag*matrix2[k*N + j].real;
			}
		}
	}
#endif

#ifndef simple_mat_mul
#ifndef simple_mat_mul_better_vars
	for (j = 0; j &amp;lt; N; j += 1){
		for (k = 0; k &amp;lt; P; k += 1){
			temp_mat[j*P + k] = matrix2[k*N + j];
		}
	}
#endif

#ifdef simple_mat_mul_openmp
#pragma omp parallel for private ( i, j, k, temp_var_1, temp_var_2, c1r, c1i, c2r, c2i) num_threads ( 4 )
#endif
	for (i = 0; i &amp;lt; M; i += 1){
		for (j = 0; j &amp;lt; N; j += 1){
			temp_var_1 = 0.0;
			temp_var_2 = 0.0;
			for (k = 0; k &amp;lt; P; k += 1){
				c1r = matrix1[i*P + k].real;
				c1i = matrix1[i*P + k].imag;
#ifdef simple_mat_mul_better_vars
				c2r = matrix2[k*N + j].real;
				c2i = matrix2[k*N + j].imag;
#endif

#ifndef simple_mat_mul_better_vars
				c2r = temp_mat[j*P + k].real;
				c2i = temp_mat[j*P + k].imag;
#endif

				temp_var_1 += c1r*c2r - c1i*c2i;
				temp_var_2 += c1r*c2i + c1i*c2r;

			}
			matrix_res[i*N + j].real = temp_var_1;
			matrix_res[i*N + j].imag = temp_var_2;
		}
	}
	delete[] temp_mat;
#endif

	return;
}

int main(){
	LARGE_INTEGER dsd_starting_time, dsd_ending_time, dsd_elapsed_microseconds_time_mkl, dsd_elapsed_microseconds_time_mine;
	LARGE_INTEGER Frequency;
	CACHE_ALIGN MKL_Complex16 *matr_1, *matr_2, *matr_3, *matr_4;

	unsigned long long int mkl_microsecond, mine_microseconds;

	int num_repetition = 20, num_start = 2, num_stop = 200;
	int iml_1, iml_2, i, j;

	MKL_Set_Num_Threads(MKL_Get_Max_Threads());

	QueryPerformanceFrequency(&amp;amp;Frequency);

	///Loop to traverse matrix sizes from num_start to num_stop
	for (iml_1 = num_start; iml_1 &amp;lt;= num_stop; iml_1++){
		mkl_microsecond = 0;
		mine_microseconds = 0;

		///Loop how many times to get total time and then average it with num_repetition
		for (iml_2 = 0; iml_2 &amp;lt; num_repetition; iml_2++){

			matr_1 = new MKL_Complex16[iml_1*iml_1];
			matr_2 = new MKL_Complex16[iml_1*iml_1];
			matr_3 = new MKL_Complex16[iml_1*iml_1];
			matr_4 = new MKL_Complex16[iml_1*iml_1];

			for (i = 0; i &amp;lt; iml_1; i++){
				for (j = 0; j &amp;lt; iml_1; j++){
					matr_1[i*iml_1 + j].real = 0.01 + (double)rand() / RAND_MAX*120.0;
					matr_1[i*iml_1 + j].imag = 0.01 + (double)rand() / RAND_MAX*120.0;

					matr_2[i*iml_1 + j].real = 0.01 + (double)rand() / RAND_MAX*120.0;
					matr_2[i*iml_1 + j].imag = 0.01 + (double)rand() / RAND_MAX*120.0;

					matr_3[i*iml_1 + j].real = 0.0;
					matr_3[i*iml_1 + j].imag = 0.0;

					matr_4[i*iml_1 + j].real = 0.0;
					matr_4[i*iml_1 + j].imag = 0.0;
				}
			}

			///MKL zgemm part
			QueryPerformanceCounter(&amp;amp;dsd_starting_time);
			m_matrix_multiply_mkl(matr_1, matr_2, matr_3, iml_1, iml_1, iml_1, "R", "N", "N");
			QueryPerformanceCounter(&amp;amp;dsd_ending_time);
			dsd_elapsed_microseconds_time_mkl.QuadPart = dsd_ending_time.QuadPart - dsd_starting_time.QuadPart;
			dsd_elapsed_microseconds_time_mkl.QuadPart *= 1000000;

			mkl_microsecond += dsd_elapsed_microseconds_time_mkl.QuadPart;

			///my implementation part
			QueryPerformanceCounter(&amp;amp;dsd_starting_time);
			m_matrix_multiply_mine(matr_1, matr_2, matr_4, iml_1, iml_1, iml_1, "R", "N", "N");
			QueryPerformanceCounter(&amp;amp;dsd_ending_time);
			dsd_elapsed_microseconds_time_mine.QuadPart = dsd_ending_time.QuadPart - dsd_starting_time.QuadPart;
			dsd_elapsed_microseconds_time_mine.QuadPart *= 1000000;


			mine_microseconds += dsd_elapsed_microseconds_time_mine.QuadPart;

			///Check if the calculated matrices by both methods give the same (approximately the same) result
			if (!check_if_matr_match(matr_3, matr_4, iml_1)){
				cout &amp;lt;&amp;lt; "ERROR, matrices are not same :  " &amp;lt;&amp;lt; iml_1 &amp;lt;&amp;lt; "\n";
			}


			delete[] matr_1;
			delete[] matr_2;
			delete[] matr_3;
			delete[] matr_4;

		}

		///Output average times for mkl and my implementation
		cout &amp;lt;&amp;lt; "Average times are (for size = " &amp;lt;&amp;lt; iml_1 &amp;lt;&amp;lt; " ):  mkl :   " &amp;lt;&amp;lt; mkl_microsecond / num_repetition / Frequency.QuadPart &amp;lt;&amp;lt; "          mine :   " &amp;lt;&amp;lt; mine_microseconds / num_repetition / Frequency.QuadPart &amp;lt;&amp;lt; "\n";
	}

	return 1;
}&lt;/PRE&gt;

&lt;P&gt;So I work with MKL_Complex16 matrices, so complex double is in play (MKL_Complex16), I used mkl version 10.0.012 but I can switch to mkl 2017 if someone thinks it would help. I work in Visual studio 2013, the target architecture is ia 32, I am in release mode, and from additional settings in project properties, and other, &amp;nbsp;I have:&lt;/P&gt;

&lt;P&gt;- Debugging Environment&amp;nbsp;_NO_DEBUG_HEAP=1&lt;/P&gt;

&lt;P&gt;- C/C++ Optimizations /O2 and /Ot&lt;/P&gt;

&lt;P&gt;- C/C++ code generation /MT and /fp:precise&lt;/P&gt;

&lt;P&gt;- C/C++ Language /openmp&lt;/P&gt;

&lt;P&gt;- Linker Optimizations /OPT:REF and /OPT:ICF&lt;/P&gt;

&lt;P&gt;- Linker Advanced Image Has safe exceptions NO&lt;/P&gt;

&lt;P&gt;- It is on windows (7 in my case)&lt;/P&gt;

&lt;P&gt;- My CPU is Intel Core i7 2670 QM @ 2.20 GHz&lt;/P&gt;

&lt;P&gt;If I missed something please let me know.&lt;/P&gt;

&lt;P&gt;Additional questions:&lt;/P&gt;

&lt;P&gt;1) Does openmp clash with mkl library or they can coexist in the same project? Is the above code correct in thos regard?&lt;/P&gt;

&lt;P&gt;2) Is there any way to improve my codes execution times ?&lt;/P&gt;

&lt;P&gt;3) If question 2) is a yes, and I can get it to be faster than mkl code for small matrices, is the same possible for matrix inversion?&lt;/P&gt;

&lt;P&gt;Thank you all in advance :D&lt;/P&gt;</description>
      <pubDate>Sat, 24 Jun 2017 23:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126558#M25289</guid>
      <dc:creator>marko_m_</dc:creator>
      <dc:date>2017-06-24T23:05:00Z</dc:date>
    </item>
    <item>
      <title>1. The intel libiomp5</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126559#M25290</link>
      <description>1. The intel libiomp5 required by mkl will displace the vs omp library and implement all msvc omp calls, includingimportant features like core affinity which Microsoft lacks.
2. Show Qvec-report results.</description>
      <pubDate>Sun, 25 Jun 2017 22:50:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126559#M25290</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-06-25T22:50:50Z</dc:date>
    </item>
    <item>
      <title>I'm not very familiar with</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126560#M25291</link>
      <description>&lt;P&gt;I'm not very familiar with the degree to which msvc supports AVX in 32-bit mode (why 32-bit mode?).&amp;nbsp; Evidently, x87 mode won't be competitive (don't even bother with vec-report).&amp;nbsp; As msvc doesn't support SSE3 code generation, you will have difficulty getting satisfactory performance with that compiler with /arch:SSE2.&amp;nbsp; You may need to split your data into separate real and imag arrays in order to take advantage of SSE2 or AVX vectorization.&lt;/P&gt;</description>
      <pubDate>Mon, 26 Jun 2017 00:47:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126560#M25291</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-06-26T00:47:35Z</dc:date>
    </item>
    <item>
      <title>Hi Marko,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126561#M25292</link>
      <description>&lt;P&gt;Hi Marko,&lt;/P&gt;

&lt;P&gt;I did test on one machine i5-6300,&amp;nbsp; Visual Studio 2015, release and 32bit model.&amp;nbsp; with MKL 2017 update 2.&amp;nbsp; Basically, when size &amp;lt; 32,&amp;nbsp; your self code had better&amp;nbsp;performance than mkl.&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; INV&amp;nbsp;operation&amp;nbsp;is complex , so maybe same or better behaviors to use MKL instead of optimize your code.&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;/P&gt;

&lt;P&gt;Ying&lt;/P&gt;

&lt;P&gt;MKL_VERBOSE Intel(R) MKL 2017.0 Update 2 Product build 20170126 for 32-bit Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Win 2.40GHz intel_thread&lt;BR /&gt;
	MKL_VERBOSE ZGEMM(N,N,2,2,2,003F6290,00C9F4E0,2,00C97E90,2,003F6280,00C9F528,2) 68.62us CNR:OFF Dyn:1 FastMM:1 TID:0&amp;nbsp; NThr:2&lt;BR /&gt;
	Average times are (for size = 2 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 124654&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 9908&lt;BR /&gt;
	MKL_VERBOSE ZGEMM(N,N,12,12,12,003F6290,00CB5B10,12,00CB5208,12,003F6280,00CB6418,12) 40.15us CNR:OFF Dyn:1 FastMM:1 TID:0&amp;nbsp; NThr:2&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;C:\Users\yhu5\Desktop\eigen\Fiona\Release&amp;gt;EU.exe&lt;BR /&gt;
	Average times are (for size = 2 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 144&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 244&lt;BR /&gt;
	Average times are (for size = 12 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 82&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 23&lt;BR /&gt;
	Average times are (for size = 22 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 55&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 30&lt;BR /&gt;
	Average times are (for size = 32 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 54&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 97&lt;BR /&gt;
	Average times are (for size = 42 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 59&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 119&lt;BR /&gt;
	Average times are (for size = 52 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 111&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 1196&lt;BR /&gt;
	Average times are (for size = 62 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 146&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 342&lt;BR /&gt;
	Average times are (for size = 72 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 171&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 546&lt;BR /&gt;
	Average times are (for size = 82 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 271&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 781&lt;BR /&gt;
	Average times are (for size = 92 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 516&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 783&lt;BR /&gt;
	Average times are (for size = 102 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 477&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 1007&lt;BR /&gt;
	Average times are (for size = 112 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 695&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 1440&lt;BR /&gt;
	Average times are (for size = 122 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 768&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 1469&lt;BR /&gt;
	Average times are (for size = 132 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 940&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 3289&lt;BR /&gt;
	Average times are (for size = 142 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 2289&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 6033&lt;BR /&gt;
	Average times are (for size = 152 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 1449&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 4930&lt;BR /&gt;
	Average times are (for size = 162 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 2230&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 6156&lt;BR /&gt;
	Average times are (for size = 172 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 2371&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 5887&lt;BR /&gt;
	Average times are (for size = 182 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 2783&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 5961&lt;BR /&gt;
	Average times are (for size = 192 ):&amp;nbsp; mkl :&amp;nbsp;&amp;nbsp; 3705&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; mine :&amp;nbsp;&amp;nbsp; 6690&lt;/P&gt;</description>
      <pubDate>Wed, 05 Jul 2017 08:53:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126561#M25292</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2017-07-05T08:53:03Z</dc:date>
    </item>
    <item>
      <title>By the way, few weeks ago I</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126562#M25293</link>
      <description>&lt;P&gt;By the way, few weeks ago I encountered an article of Intel about Intel MKL for improvement in MKL when one wants to multiply many matrices by the same matrix.&lt;/P&gt;

&lt;P&gt;Anyone knows the link to this article?&lt;/P&gt;

&lt;P&gt;Thank You.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Jun 2018 21:58:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126562#M25293</guid>
      <dc:creator>Royi</dc:creator>
      <dc:date>2018-06-22T21:58:04Z</dc:date>
    </item>
    <item>
      <title>Hi Royi,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126563#M25294</link>
      <description>&lt;P&gt;Hi Royi,&lt;BR /&gt;
	&lt;BR /&gt;
	do you mean&amp;nbsp;another small matrix optimization &lt;I&gt;compact format in MKL?&lt;/I&gt;&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/intelr-math-kernel-library-introducing-vectorized-compact-routines" target="_blank"&gt;https://software.intel.com/en-us/articles/intelr-math-kernel-library-introducing-vectorized-compact-routines&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;One for larger matrix:&lt;BR /&gt;
	Packed APIs for GEMM&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/introducing-the-new-packed-apis-for-gemm" target="_blank"&gt;https://software.intel.com/en-us/articles/introducing-the-new-packed-apis-for-gemm&lt;/A&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	Best Regards,&lt;BR /&gt;
	​Ying&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 25 Jun 2018 03:38:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Small-matrix-speed-optimization/m-p/1126563#M25294</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2018-06-25T03:38:07Z</dc:date>
    </item>
  </channel>
</rss>

