- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
using vs2013 CPU Intel(R)Core(TM) i7 4790
MKL 2017 Sequential
Doing a Test:
#include<stdio.h>
#include<string.h>
#include"mkl.h"
#include <time.h>
#include <pmmintrin.h>
void matrixMultiplicationNormal(float _r[4], float m[4][4], float v[4])
{
float r[4];
r[0] = m[0][0] * v[0] + m[0][1] * v[1] + m[0][2] * v[2] + m[0][3] * v[3];
r[1] = m[1][0] * v[0] + m[1][1] * v[1] + m[1][2] * v[2] + m[1][3] * v[3];
r[2] = m[2][0] * v[0] + m[2][1] * v[1] + m[2][2] * v[2] + m[2][3] * v[3];
r[3] = m[3][0] * v[0] + m[3][1] * v[1] + m[3][2] * v[2] + m[3][3] * v[3];
memcpy(_r, r, 16);
}
void matrixMultiplicationsse3(float _r[4], float m[4][4], float v[4])
{
__m128 *matrix = (__m128 *)m, *vector = (__m128 *)v;
__m128 x = _mm_mul_ps(matrix[0], *vector);
__m128 y = _mm_mul_ps(matrix[1], *vector);
__m128 z = _mm_mul_ps(matrix[2], *vector);
__m128 w = _mm_mul_ps(matrix[3], *vector);
__m128 tmp1 = _mm_hadd_ps(x, y); // = [y2+y3, y0+y1, x2+x3, x0+x1]
__m128 tmp2 = _mm_hadd_ps(z, w); // = [w2+w3, w0+w1, z2+z3, z0+z1]
_mm_storeu_ps(_r, _mm_hadd_ps(tmp1, tmp2)); // = [w0+w1+w2+w3, z0+z1+z2+z3, y0+y1+y2+y3, x0+x1+x2+x3]
}
void matrixMultiplicationMKL(float _r[4], float m[4][4], float v[4])
{
cblas_sgemv(CblasRowMajor, CblasNoTrans,4,4,1,(float*)m,4,v,1,0,_r,1);
}
int main()
{
__declspec(align(32)) float outNormal[4] = { 0 };
__declspec(align(32)) float outMKL[4] = { 0 };
__declspec(align(32)) float outsse3[4] = { 0 };
__declspec(align(32)) float in[4] = { 1.0, 2.3, 5.5, 4.4 };
__declspec(align(32)) float matrix[4][4] = { 1, 0, 0, 2,
0,1,0,2,
0,0,1,3,
0,0,0,1};
int s1, s2,s3;
int t = clock();
for (int i = 0; i < 5000000; i++)
matrixMultiplicationNormal(outNormal, matrix, in);
s1 = clock()-t;
t = clock();
for (int i = 0; i < 5000000; i++)
matrixMultiplicationsse3(outsse3, matrix, in);
s2 = clock() - t;
t = clock();
for (int i = 0; i < 5000000; i++)
matrixMultiplicationMKL(outMKL, matrix, in);
s3 = clock() - t;
printf("use normal time=%dms\nuse sse3 time=%dms\nuse intel mkl time=%dms", s1, s2, s3);
return 0;
}
use normal time =218 ms
use sse3 time =156ms
use intel mkl time=374ms
Is Intel MKL slower for small matrix vector mul?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear customer,
For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation.
Best regards,
Fiona
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear customer,
For small matrix for calculation, the initialization time would be substantial when timing matrix-matrix/matrix-vector multiplication. Using a small matrix for the first call won’t initialize the threads since Intel MKL executes multi-threaded code only for sufficiently large matrices. Your post is valuable, I will give a feedback to developing team. If you plan to use very small matrix which is less than 20* 20, I recommend to use common c code... Another point is, for C code, you actually spend much time for memory-memory copy function, not matrix calculation.
Best regards,
Fiona
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sun,
As you can expect, there are overheads for calling a library; these are more apparent when the amount of computation done is small. In Intel MKL 11.2, a new feature (direct call) was introduced, designed to help address this problem (see https://software.intel.com/en-us/articles/improve-intel-mkl-performance-for-small-problems-the-use-of-mkl-direct-call for more details). While sgemv is not currently supported in direct call, you can replace the gemv call with one to gemm, which is supported:
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,4,1,4,1,(float*)m,4,v,1,0,_r,1);
You’ll also need to define MKL_DIRECT_CALL_SEQ when linking to enable direct call.
Thank you,
Sarah
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page