Solved: Problems with vectorization for the matrix multiply function with tiling at register level

windigo84 · ‎10-01-2010

Hi all,

I am trying to write in high level code the matrix multiply function with tiling at the register level and let the intel compiler to vectorize it. I have tested two different versions. The first one defining the matrices as float** (pointer to pointer to float) in order to index them in the most intuitive way for the programmer:

[bash]
[/bash]

[bash]void multiply(const float **restrict A,const float **restrict B,float **restrict C, int dim1, int dim2, int dim3){
	int i, j, k;
	 float C1[4], C2[4], B1[4], A1, A2;
		for (j = 0; j < dim3-7; j+=4)
		{
			#pragma vector aligned
			#pragma vector always
			for (i = 0; i < dim1-5; i+=2)
			{
				C1[0] = C;
				C1[1] = C[j+1];
				C1[2] = C[j+2];
				C1[3] = C[j+3];
				
				C2[0] = C[i+1];
				C2[1] = C[i+1][j+1];
				C2[2] = C[i+1][j+2];
				C2[3] = C[i+1][j+3];
				
				#pragma vector aligned
				#pragma vector always
				for (k = 0; k < dim2; k++)
				{
					B1[0] = B;
					B1[1] = B[j+1];
					B1[2] = B[j+2];
					B1[3] = B[j+3];
					
						A1 = A;
					
					
							C1[0] += A1*B1[0];
							C1[1] += A1*B1[1];
							C1[2] += A1*B1[2];
							C1[3] += A1*B1[3];
							
						A2= A[i+1];
						
							C2[0] += A2*B1[0];
							C2[1] += A2*B1[1];
							C2[2] += A2*B1[2];
							C2[3] += A2*B1[3];
				}
				C = C2[0];
				C[j+1] = C2[1];
				C[j+2] = C2[2];
				C[j+3] = C2[3];
				
				C[i+1] = C1[0];
				C[i+1][j+1] = C1[1];
				C[i+1][j+2] = C1[2];
				C[i+1][j+3] = C1[3];
			}
		}
}[/bash]

The other version is defining the matrices as float* (pointer to float), indexing manually the positions of the matrices:

[cpp]void multiply(const float *restrict A,const float *restrict B,float *restrict C, int dim1, int dim2, int dim3){
	int i, j, k, ii, kk;
	 float C1[4], C2[4], B1[4], A1, A2;

		for (j = 0; j < dim3-7; j+=4)
		{
			#pragma vector aligned
			#pragma vector always
			for (i = 0; i < dim1-5; i+=2)
			{
				C1[0] = C[i*dim1 +j];
				C1[1] = C[i*dim1 +j+1];
				C1[2] = C[i*dim1 +j+2];
				C1[3] = C[i*dim1 +j+3];
				
				C2[0] = C[(i+1)*dim1 +j];
				C2[1] = C[(i+1)*dim1 +j+1];
				C2[2] = C[(i+1)*dim1 +j+2];
				C2[3] = C[(i+1)*dim1 +j+3];
				
				#pragma vector aligned
				#pragma vector always
				for (k = 0; k < dim2; k++)
				{
					B1[0] = B[k*dim2 + j];
					B1[1] = B[k*dim2 + j+1];
					B1[2] = B[k*dim2 + j+2];
					B1[3] = B[k*dim2 + j+3];
				
						A1 = A[(i)*dim1 + k];
					
							C1[0] += A1*B1[0];
							C1[1] += A1*B1[1];
							C1[2] += A1*B1[2];
							C1[3] += A1*B1[3];
							
						A2 = A[(i+1)*dim1 + k];
						
							C2[0] += A2*B1[0];
							C2[1] += A2*B1[1];
							C2[2] += A2*B1[2];
							C2[3] += A2*B1[3];
				}
				C[i*dim1 +j] = C1[0];
				C[i*dim1 +j+1] = C1[1];
				C[i*dim1 +j+2] = C1[2];
				C[i*dim1 +j+3] = C1[3];
				
				C[(i+1)*dim1 +j] = C2[0];
				C[(i+1)*dim1 +j+1] = C2[1];
				C[(i+1)*dim1 +j+2] = C2[2];
				C[(i+1)*dim1 +j+3] = C2[3];
			}
		}
}[/cpp]

I compile with the next flags:

[bash]icc -fast -fno-alias -restrict -msse3 -vec-report3[/bash]

The compiler vectorizes the product and the addition done in the inner loop with SIMD registers for both versions. The problem is in the assignments of C and B to the buffers C1, C2 and B1 respectively. For example, the compiler translates the assignment of C to C1[0:3] as:

[cpp] 
movss     4(%rdx,%rcx,4), %xmm12   
movss     8(%rdx,%rcx,4), %xmm10
movss     12(%rdx,%rcx,4), %xmm11    
unpcklps  %xmm10, %xmm1
unpcklps  %xmm11, %xmm12 
unpcklps  %xmm12, %xmm1[/cpp]

where rdx + rcx*4 are the base address of C.

As the matrices are all 16 bytes aligned the compiler should transate it as:

[cpp]movaps     (%rdx,%rcx,4), %xmm1[/cpp]

which obtains the same result in xmm1 register, C.

The output of the icc for this blocks of the vec-report3 flag is:

[bash]remark: loop was not vectorized: unsupported data type.[/bash]

I have tried different solutions but all of them obtain the same result. Is there any way to write the code in order to vectorize also the movements between the matrix positions and the vector registers? Am I writing it wrong? I hope somebody can help me,

Thank you in advance,

Alejandro

Dale_S_Intel · ‎10-20-2010

Sorry for missing this post earlier. I've been playing around with this and it looks like if you rewrite all the unrolled bits, i.e. replace the blocks of 4 assignments with a loop then the vectorizer will vectorize all of them, though in some cases it thinks it might be inefficient and needs encouragement from "#pragma vector always". Here's the code I came up with, based on your first version above:

[bash]#include 
#include 


void multiply(const float **restrict A,const float **restrict B,float **restrict C, int dim1, int dim2, int dim3)
{
    int i, j, k;  
     float C1[4], C2[4], B1[4], A1, A2;  
        for (j = 0; j < dim3-7; j+=4)  
        {  
            #pragma vector aligned  
            #pragma vector always  
            for (i = 0; i < dim1-5; i+=2)  
            {  
		int dalei;

		for (dalei=0; dalei<4; dalei++)
			C1[dalei] = C[j+dalei];

		for (dalei=0; dalei<4; dalei++)
			C2[dalei] = C[i+1][j+dalei];

                #pragma vector aligned  
                #pragma vector always  
                for (k = 0; k < dim2; k++)  
                {  
		    for (dalei=0; dalei<4; dalei++)
			B1[dalei] = B[j+dalei];

                    A1 = A;  
                      
                      
		    for (dalei=0; dalei<4; dalei++)
                            C1[dalei] += A1*B1[dalei];  
                              
                    A2= A[i+1];  
                          
		    for (dalei=0; dalei<4; dalei++)
                            C2[dalei] += A2*B1[dalei];  
                }  
                #pragma vector always
		for (dalei=0; dalei<4; dalei++)
                    C[j+dalei] = C2[dalei];  
                  
                #pragma vector always
		for (dalei=0; dalei<4; dalei++)
                    C[i+1][j+dalei] = C1[dalei];  
            }  
        }  
}  
[/bash]

When I compile this I get:

[bash]$ icc -g -O2 -fast -fno-alias -restrict -xSSE3 -vec-report3 -S v.c
v.c(9): (col. 9) remark: loop was not vectorized: not inner loop.
v.c(13): (col. 13) remark: loop was not vectorized: not inner loop.
v.c(17): (col. 3) remark: LOOP WAS VECTORIZED.
v.c(20): (col. 3) remark: LOOP WAS VECTORIZED.
v.c(25): (col. 17) remark: loop was not vectorized: not inner loop.
v.c(27): (col. 7) remark: LOOP WAS VECTORIZED.
v.c(33): (col. 7) remark: LOOP WAS VECTORIZED.
v.c(38): (col. 7) remark: LOOP WAS VECTORIZED.
v.c(42): (col. 3) remark: LOOP WAS VECTORIZED.
v.c(46): (col. 3) remark: LOOP WAS VECTORIZED.
[/bash]

So it looks like the vectorize is missing some of the block code, but if you convert it back to loops it gets it. Does that do what you are looking for?

View solution in original post

TimP · ‎10-01-2010

If you believe it is valuable to use parallel moves outside the inner loop, you may need intrinsics.
Most people are satisfied when they can reach 80% efficiency with compiled code, and they often beat pre-compiled libraries such as ACML or linux distro blas by compiling the public source code. Are you expecting to beat the MKL which comes with icc?

windigo84 · ‎10-04-2010

My objective is not to beat the MKL library, just to get close to the performance it achieves with transformations that can be automated in a compiler. I am performing a study on how tiling and vectorization optimizations can be merged in a compiler. To do it first of all I wrote all the code in assembler by hand but now I am trying to write it in high level code without intrinsics and letting the compiler to do the rest job (such as vectorization). The idea is to create a rewritter program that takes a nested loop and rewrites it automatically (without any help of the programmer) applying tiling, vectorization and copy of the matrices in order to permit the icc to compile it achieving high performance rates. To do that I need to know how I can write the code in high level c code in order to force the compiler to do what I want. I did it with the intrinsics directives but this implies to take into account for example the number of available SIMD register and this can be done by the compiler. This is the reason I have to try to use only high level c code without intrinsics. Do you have any idea on how I can get that without any intrinsic directive?

Thanks.

Dale_S_Intel · ‎10-20-2010

Sorry for missing this post earlier. I've been playing around with this and it looks like if you rewrite all the unrolled bits, i.e. replace the blocks of 4 assignments with a loop then the vectorizer will vectorize all of them, though in some cases it thinks it might be inefficient and needs encouragement from "#pragma vector always". Here's the code I came up with, based on your first version above:

[bash]#include 
#include 


void multiply(const float **restrict A,const float **restrict B,float **restrict C, int dim1, int dim2, int dim3)
{
    int i, j, k;  
     float C1[4], C2[4], B1[4], A1, A2;  
        for (j = 0; j < dim3-7; j+=4)  
        {  
            #pragma vector aligned  
            #pragma vector always  
            for (i = 0; i < dim1-5; i+=2)  
            {  
		int dalei;

		for (dalei=0; dalei<4; dalei++)
			C1[dalei] = C[j+dalei];

		for (dalei=0; dalei<4; dalei++)
			C2[dalei] = C[i+1][j+dalei];

                #pragma vector aligned  
                #pragma vector always  
                for (k = 0; k < dim2; k++)  
                {  
		    for (dalei=0; dalei<4; dalei++)
			B1[dalei] = B[j+dalei];

                    A1 = A;  
                      
                      
		    for (dalei=0; dalei<4; dalei++)
                            C1[dalei] += A1*B1[dalei];  
                              
                    A2= A[i+1];  
                          
		    for (dalei=0; dalei<4; dalei++)
                            C2[dalei] += A2*B1[dalei];  
                }  
                #pragma vector always
		for (dalei=0; dalei<4; dalei++)
                    C[j+dalei] = C2[dalei];  
                  
                #pragma vector always
		for (dalei=0; dalei<4; dalei++)
                    C[i+1][j+dalei] = C1[dalei];  
            }  
        }  
}  
[/bash]

When I compile this I get:

[bash]$ icc -g -O2 -fast -fno-alias -restrict -xSSE3 -vec-report3 -S v.c
v.c(9): (col. 9) remark: loop was not vectorized: not inner loop.
v.c(13): (col. 13) remark: loop was not vectorized: not inner loop.
v.c(17): (col. 3) remark: LOOP WAS VECTORIZED.
v.c(20): (col. 3) remark: LOOP WAS VECTORIZED.
v.c(25): (col. 17) remark: loop was not vectorized: not inner loop.
v.c(27): (col. 7) remark: LOOP WAS VECTORIZED.
v.c(33): (col. 7) remark: LOOP WAS VECTORIZED.
v.c(38): (col. 7) remark: LOOP WAS VECTORIZED.
v.c(42): (col. 3) remark: LOOP WAS VECTORIZED.
v.c(46): (col. 3) remark: LOOP WAS VECTORIZED.
[/bash]

So it looks like the vectorize is missing some of the block code, but if you convert it back to loops it gets it. Does that do what you are looking for?

windigo84 · ‎10-21-2010

Yes Dale, this is what I was looking for, thanks for the help.