Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

4x4 matrix 1x4 vector slower using MKL

sun_c_
Beginner
818 Views

using vs2013   CPU Intel(R)Core(TM) i7 4790

MKL 2017 Sequential

Doing a Test:

#include<stdio.h>
#include<string.h>
#include"mkl.h"
#include <time.h>
#include <pmmintrin.h>

 

void matrixMultiplicationNormal(float _r[4], float m[4][4], float v[4])
{
 float r[4];

 r[0] = m[0][0] * v[0] + m[0][1] * v[1] + m[0][2] * v[2] + m[0][3] * v[3];
 r[1] = m[1][0] * v[0] + m[1][1] * v[1] + m[1][2] * v[2] + m[1][3] * v[3];
 r[2] = m[2][0] * v[0] + m[2][1] * v[1] + m[2][2] * v[2] + m[2][3] * v[3];
 r[3] = m[3][0] * v[0] + m[3][1] * v[1] + m[3][2] * v[2] + m[3][3] * v[3];

 memcpy(_r, r, 16);

}

void matrixMultiplicationsse3(float _r[4], float m[4][4], float v[4])
{
 __m128 *matrix = (__m128 *)m, *vector = (__m128 *)v;
 
 __m128 x = _mm_mul_ps(matrix[0], *vector);
 __m128 y = _mm_mul_ps(matrix[1], *vector);
 __m128 z = _mm_mul_ps(matrix[2], *vector);
 __m128 w = _mm_mul_ps(matrix[3], *vector);
 __m128 tmp1 = _mm_hadd_ps(x, y); // = [y2+y3, y0+y1, x2+x3, x0+x1]
 __m128 tmp2 = _mm_hadd_ps(z, w); // = [w2+w3, w0+w1, z2+z3, z0+z1]

 _mm_storeu_ps(_r, _mm_hadd_ps(tmp1, tmp2)); // = [w0+w1+w2+w3, z0+z1+z2+z3, y0+y1+y2+y3, x0+x1+x2+x3]
}

void matrixMultiplicationMKL(float _r[4], float m[4][4], float v[4])
{
 cblas_sgemv(CblasRowMajor, CblasNoTrans,4,4,1,(float*)m,4,v,1,0,_r,1);
}

 

int main()
{

 

 __declspec(align(32)) float outNormal[4] = { 0 };
 __declspec(align(32)) float outMKL[4] = { 0 };
 __declspec(align(32)) float outsse3[4] = { 0 };

 __declspec(align(32)) float in[4] = { 1.0, 2.3, 5.5, 4.4 };

 __declspec(align(32)) float matrix[4][4] = { 1, 0, 0, 2,
                     0,1,0,2,
      0,0,1,3,
      0,0,0,1};
 int s1, s2,s3;
 int t = clock();
 for (int i = 0; i < 5000000; i++)
  matrixMultiplicationNormal(outNormal, matrix, in);
 s1 =  clock()-t;
 t = clock();
 for (int i = 0; i < 5000000; i++)
  matrixMultiplicationsse3(outsse3, matrix, in);
 s2 = clock() - t;

 t = clock();
 for (int i = 0; i < 5000000; i++)
  matrixMultiplicationMKL(outMKL, matrix, in);
 s3 = clock() - t;
 printf("use normal time=%dms\nuse sse3 time=%dms\nuse intel mkl time=%dms", s1, s2, s3);

 return 0;
}

 

use normal time =218 ms

use sse3 time =156ms

use intel mkl time=374ms

Is Intel MKL slower for small matrix vector mul?

 

 

0 Kudos
1 Solution
Zhen_Z_Intel
Employee
818 Views

Dear customer,

For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation. 

Best regards,
Fiona

View solution in original post

0 Kudos
3 Replies
Zhen_Z_Intel
Employee
819 Views

Dear customer,

For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation. 

Best regards,
Fiona

0 Kudos
sun_c_
Beginner
818 Views

Thanks.

0 Kudos
Sarah_K_Intel
Employee
819 Views

Hi Sun,

As you can expect, there are overheads for calling a library; these are more apparent when the amount of computation done is small.  In Intel MKL 11.2, a new feature (direct call) was introduced, designed to help address this problem (see https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call for more details).  While sgemv is not currently supported in direct call, you can replace the gemv call with one to gemm, which is supported:
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,4,1,4,1,(float*)m,4,v,1,0,_r,1);

You’ll also need to define MKL_DIRECT_CALL_SEQ when linking to enable direct call.

Thank you,

Sarah

0 Kudos
Reply