Solved: 4x4 matrix 1x4 vector slower using MKL

sun_c_ · ‎11-17-2016

using vs2013 CPU Intel(R)Core(TM) i7 4790

MKL 2017 Sequential

Doing a Test:

#include<stdio.h>
#include<string.h>
#include"mkl.h"
#include <time.h>
#include <pmmintrin.h>

void matrixMultiplicationNormal(float _r[4], float m[4][4], float v[4])
{
float r[4];

r[0] = m[0][0] * v[0] + m[0][1] * v[1] + m[0][2] * v[2] + m[0][3] * v[3];
r[1] = m[1][0] * v[0] + m[1][1] * v[1] + m[1][2] * v[2] + m[1][3] * v[3];
r[2] = m[2][0] * v[0] + m[2][1] * v[1] + m[2][2] * v[2] + m[2][3] * v[3];
r[3] = m[3][0] * v[0] + m[3][1] * v[1] + m[3][2] * v[2] + m[3][3] * v[3];

memcpy(_r, r, 16);

}

void matrixMultiplicationsse3(float _r[4], float m[4][4], float v[4])
{
__m128 *matrix = (__m128 *)m, *vector = (__m128 *)v;

__m128 x = _mm_mul_ps(matrix[0], *vector);
__m128 y = _mm_mul_ps(matrix[1], *vector);
__m128 z = _mm_mul_ps(matrix[2], *vector);
__m128 w = _mm_mul_ps(matrix[3], *vector);
__m128 tmp1 = _mm_hadd_ps(x, y); // = [y2+y3, y0+y1, x2+x3, x0+x1]
__m128 tmp2 = _mm_hadd_ps(z, w); // = [w2+w3, w0+w1, z2+z3, z0+z1]

_mm_storeu_ps(_r, _mm_hadd_ps(tmp1, tmp2)); // = [w0+w1+w2+w3, z0+z1+z2+z3, y0+y1+y2+y3, x0+x1+x2+x3]
}

void matrixMultiplicationMKL(float _r[4], float m[4][4], float v[4])
{
cblas_sgemv(CblasRowMajor, CblasNoTrans,4,4,1,(float*)m,4,v,1,0,_r,1);
}

int main()
{

__declspec(align(32)) float outNormal[4] = { 0 };
__declspec(align(32)) float outMKL[4] = { 0 };
__declspec(align(32)) float outsse3[4] = { 0 };

__declspec(align(32)) float in[4] = { 1.0, 2.3, 5.5, 4.4 };

__declspec(align(32)) float matrix[4][4] = { 1, 0, 0, 2,
                     0,1,0,2,
      0,0,1,3,
      0,0,0,1};
int s1, s2,s3;
int t = clock();
for (int i = 0; i < 5000000; i++)
  matrixMultiplicationNormal(outNormal, matrix, in);
s1 = clock()-t;
t = clock();
for (int i = 0; i < 5000000; i++)
  matrixMultiplicationsse3(outsse3, matrix, in);
s2 = clock() - t;

t = clock();
for (int i = 0; i < 5000000; i++)
matrixMultiplicationMKL(outMKL, matrix, in);
s3 = clock() - t;
printf("use normal time=%dms\nuse sse3 time=%dms\nuse intel mkl time=%dms", s1, s2, s3);

return 0;
}

use normal time =218 ms

use sse3 time =156ms

use intel mkl time=374ms

Is Intel MKL slower for small matrix vector mul?

Zhen_Z_Intel · ‎11-17-2016

Dear customer,

For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation.

Best regards,
Fiona

View solution in original post

Zhen_Z_Intel · ‎11-17-2016

Dear customer,

For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation.

Best regards,
Fiona

sun_c_ · ‎11-17-2016

Thanks.

Sarah_K_Intel · ‎11-18-2016

Hi Sun,

As you can expect, there are overheads for calling a library; these are more apparent when the amount of computation done is small. In Intel MKL 11.2, a new feature (direct call) was introduced, designed to help address this problem (see https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call for more details). While sgemv is not currently supported in direct call, you can replace the gemv call with one to gemm, which is supported:
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,4,1,4,1,(float*)m,4,v,1,0,_r,1);

You’ll also need to define MKL_DIRECT_CALL_SEQ when linking to enable direct call.

Thank you,

Sarah