- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am trying to write in high level code the matrix multiply function with tiling at the register level and let the intel compiler to vectorize it. I have tested two different versions. The first one defining the matrices as float** (pointer to pointer to float) in order to index them in the most intuitive way for the programmer:

[bash]

[/bash]

[bash]void multiply(const float **restrict A,const float **restrict B,float **restrict C, int dim1, int dim2, int dim3){ int i, j, k; float C1[4], C2[4], B1[4], A1, A2; for (j = 0; j < dim3-7; j+=4) { #pragma vector aligned #pragma vector always for (i = 0; i < dim1-5; i+=2) { C1[0] = C; C1[1] = C [j+1]; C1[2] = C[j+2]; C1[3] = C[j+3]; C2[0] = C[i+1]; C2[1] = C[i+1][j+1]; C2[2] = C[i+1][j+2]; C2[3] = C[i+1][j+3]; #pragma vector aligned #pragma vector always for (k = 0; k < dim2; k++) { B1[0] = B ; B1[1] = B [j+1]; B1[2] = B [j+2]; B1[3] = B [j+3]; A1 = A ; C1[0] += A1*B1[0]; C1[1] += A1*B1[1]; C1[2] += A1*B1[2]; C1[3] += A1*B1[3]; A2= A[i+1] ; C2[0] += A2*B1[0]; C2[1] += A2*B1[1]; C2[2] += A2*B1[2]; C2[3] += A2*B1[3]; } C = C2[0]; C [j+1] = C2[1]; C[j+2] = C2[2]; C[j+3] = C2[3]; C[i+1]= C1[0]; C[i+1][j+1] = C1[1]; C[i+1][j+2] = C1[2]; C[i+1][j+3] = C1[3]; } } }[/bash]

The other version is defining the matrices as float* (pointer to float), indexing manually the positions of the matrices:

[cpp]void multiply(const float *restrict A,const float *restrict B,float *restrict C, int dim1, int dim2, int dim3){ int i, j, k, ii, kk; float C1[4], C2[4], B1[4], A1, A2; for (j = 0; j < dim3-7; j+=4) { #pragma vector aligned #pragma vector always for (i = 0; i < dim1-5; i+=2) { C1[0] = C[i*dim1 +j]; C1[1] = C[i*dim1 +j+1]; C1[2] = C[i*dim1 +j+2]; C1[3] = C[i*dim1 +j+3]; C2[0] = C[(i+1)*dim1 +j]; C2[1] = C[(i+1)*dim1 +j+1]; C2[2] = C[(i+1)*dim1 +j+2]; C2[3] = C[(i+1)*dim1 +j+3]; #pragma vector aligned #pragma vector always for (k = 0; k < dim2; k++) { B1[0] = B[k*dim2 + j]; B1[1] = B[k*dim2 + j+1]; B1[2] = B[k*dim2 + j+2]; B1[3] = B[k*dim2 + j+3]; A1 = A[(i)*dim1 + k]; C1[0] += A1*B1[0]; C1[1] += A1*B1[1]; C1[2] += A1*B1[2]; C1[3] += A1*B1[3]; A2 = A[(i+1)*dim1 + k]; C2[0] += A2*B1[0]; C2[1] += A2*B1[1]; C2[2] += A2*B1[2]; C2[3] += A2*B1[3]; } C[i*dim1 +j] = C1[0]; C[i*dim1 +j+1] = C1[1]; C[i*dim1 +j+2] = C1[2]; C[i*dim1 +j+3] = C1[3]; C[(i+1)*dim1 +j] = C2[0]; C[(i+1)*dim1 +j+1] = C2[1]; C[(i+1)*dim1 +j+2] = C2[2]; C[(i+1)*dim1 +j+3] = C2[3]; } } }[/cpp]

I compile with the next flags:

[bash]icc -fast -fno-alias -restrict -msse3 -vec-report3[/bash]

The compiler vectorizes the product and the addition done in the inner loop with SIMD registers for both versions. The problem is in the assignments of C and B to the buffers C1, C2 and B1 respectively. For example, the compiler translates the assignment of C

to C1[0:3] as: [cpp] movss 4(%rdx,%rcx,4), %xmm12 movss 8(%rdx,%rcx,4), %xmm10 movss 12(%rdx,%rcx,4), %xmm11 unpcklps %xmm10, %xmm1 unpcklps %xmm11, %xmm12 unpcklps %xmm12, %xmm1[/cpp]where rdx + rcx*4 are the base address of C

.

As the matrices are all 16 bytes aligned the compiler should transate it as:

[cpp]movaps (%rdx,%rcx,4), %xmm1[/cpp]which obtains the same result in xmm1 register, C

.

The output of the icc for this blocks of the vec-report3 flag is:

[bash]remark: loop was not vectorized: unsupported data type.[/bash]I have tried different solutions but all of them obtain the same result. Is there any way to write the code in order to vectorize also the movements between the matrix positions and the vector registers? Am I writing it wrong? I hope somebody can help me,

Thank you in advance,

Alejandro

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

[bash]#includeWhen I compile this I get:#include void multiply(const float **restrict A,const float **restrict B,float **restrict C, int dim1, int dim2, int dim3) { int i, j, k; float C1[4], C2[4], B1[4], A1, A2; for (j = 0; j < dim3-7; j+=4) { #pragma vector aligned #pragma vector always for (i = 0; i < dim1-5; i+=2) { int dalei; for (dalei=0; dalei<4; dalei++) C1[dalei] = C [j+dalei]; for (dalei=0; dalei<4; dalei++) C2[dalei] = C[i+1][j+dalei]; #pragma vector aligned #pragma vector always for (k = 0; k < dim2; k++) { for (dalei=0; dalei<4; dalei++) B1[dalei] = B[j+dalei]; A1 = A ; for (dalei=0; dalei<4; dalei++) C1[dalei] += A1*B1[dalei]; A2= A[i+1] ; for (dalei=0; dalei<4; dalei++) C2[dalei] += A2*B1[dalei]; } #pragma vector always for (dalei=0; dalei<4; dalei++) C [j+dalei] = C2[dalei]; #pragma vector always for (dalei=0; dalei<4; dalei++) C[i+1][j+dalei] = C1[dalei]; } } } [/bash]

[bash]So it looks like the vectorize is missing some of the block code, but if you convert it back to loops it gets it. Does that do what you are looking for?$ icc -g -O2 -fast -fno-alias -restrict -xSSE3 -vec-report3 -S v.c v.c(9): (col. 9) remark: loop was not vectorized: not inner loop. v.c(13): (col. 13) remark: loop was not vectorized: not inner loop. v.c(17): (col. 3) remark: LOOP WAS VECTORIZED. v.c(20): (col. 3) remark: LOOP WAS VECTORIZED. v.c(25): (col. 17) remark: loop was not vectorized: not inner loop. v.c(27): (col. 7) remark: LOOP WAS VECTORIZED. v.c(33): (col. 7) remark: LOOP WAS VECTORIZED. v.c(38): (col. 7) remark: LOOP WAS VECTORIZED. v.c(42): (col. 3) remark: LOOP WAS VECTORIZED. v.c(46): (col. 3) remark: LOOP WAS VECTORIZED.[/bash]

Link Copied

4 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Most people are satisfied when they can reach 80% efficiency with compiled code, and they often beat pre-compiled libraries such as ACML or linux distro blas by compiling the public source code. Are you expecting to beat the MKL which comes with icc?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

[bash]#includeWhen I compile this I get:#include void multiply(const float **restrict A,const float **restrict B,float **restrict C, int dim1, int dim2, int dim3) { int i, j, k; float C1[4], C2[4], B1[4], A1, A2; for (j = 0; j < dim3-7; j+=4) { #pragma vector aligned #pragma vector always for (i = 0; i < dim1-5; i+=2) { int dalei; for (dalei=0; dalei<4; dalei++) C1[dalei] = C [j+dalei]; for (dalei=0; dalei<4; dalei++) C2[dalei] = C[i+1][j+dalei]; #pragma vector aligned #pragma vector always for (k = 0; k < dim2; k++) { for (dalei=0; dalei<4; dalei++) B1[dalei] = B[j+dalei]; A1 = A ; for (dalei=0; dalei<4; dalei++) C1[dalei] += A1*B1[dalei]; A2= A[i+1] ; for (dalei=0; dalei<4; dalei++) C2[dalei] += A2*B1[dalei]; } #pragma vector always for (dalei=0; dalei<4; dalei++) C [j+dalei] = C2[dalei]; #pragma vector always for (dalei=0; dalei<4; dalei++) C[i+1][j+dalei] = C1[dalei]; } } } [/bash]

[bash]So it looks like the vectorize is missing some of the block code, but if you convert it back to loops it gets it. Does that do what you are looking for?$ icc -g -O2 -fast -fno-alias -restrict -xSSE3 -vec-report3 -S v.c v.c(9): (col. 9) remark: loop was not vectorized: not inner loop. v.c(13): (col. 13) remark: loop was not vectorized: not inner loop. v.c(17): (col. 3) remark: LOOP WAS VECTORIZED. v.c(20): (col. 3) remark: LOOP WAS VECTORIZED. v.c(25): (col. 17) remark: loop was not vectorized: not inner loop. v.c(27): (col. 7) remark: LOOP WAS VECTORIZED. v.c(33): (col. 7) remark: LOOP WAS VECTORIZED. v.c(38): (col. 7) remark: LOOP WAS VECTORIZED. v.c(42): (col. 3) remark: LOOP WAS VECTORIZED. v.c(46): (col. 3) remark: LOOP WAS VECTORIZED.[/bash]

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yes Dale, this is what I was looking for, thanks for the help.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page