Solved: Problems with loop interchange optimization

windigo84 · ‎04-23-2010

Hi all,

I'm testing some vectorization capabilities of the icc compiler with the kernel matrix multiply code. Depending on the order of the loops the compiler vectorizes or not. Therefore I'm compiling with the option -O3 to force icc to do a loop interchange in order to be able to vectorize always the inner loop of the matrix multiply. I declare the matrices inside the main and perform the multiplication also inside the main (to avoid the need of inter procedural information). Here is the code:

[cpp]int main()
{
	int dimi = 148; 
	int dimj = 148; 
	int dimk = 148; 

	int i, j, k;

	float **c = (float **)malloc(sizeof(float *)*dimi);
	c[0] = (float *)malloc(dimi * dimj * sizeof(float));
	for(i = 1; i < dimi; i++)
		c = c[0] + i * dimj;

	float **b = (float **)malloc(sizeof(float *)*dimk);
	b[0] = (float *)malloc(dimk * dimj * sizeof(float));
	for(i = 1; i < dimk; i++)
		b = b[0] + i * dimj;

	float **a = (float **)malloc(sizeof(float *)*dimi);
	a[0] = (float *)malloc(dimi * dimk * sizeof(float));
	for(i = 1; i < dimi; i++)
		a = a[0] + i * dimk;


	inic_m(a, dimi, dimk);
	inic_m2(b, 2.0, dimk, dimj);
	inic_m3(c, 3.1, dimi, dimj);



	for (i = 0; i < dimi; i++)
	{
		for (k = 0; k < dimk; k++)
		{
			for (j = 0; j < dimj; j++)
			{
				c += a*b;
			}

		}
	}



	return 0;
}
[/cpp]

But the compiler doesn't do loop interchange. Even with the -O3 optimization option icc leaves the loop order as I specify in the code. Changing the order of the loops I get different information with -vec-report3. So for the orders:

JIK, JKI, KJI and IJK the icc compiler -vec-repor3 option returns (116 is the line number of the inner loop):

[bash]matrizXmatrizTests.c(112) : (col. 2) remark: loop was not vectorized: not inner loop.
matrizXmatrizTests.c(114) : (col. 3) remark: loop was not vectorized: not inner loop.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: vector dependence: assumed FLOW dependence between reference at line 118 and reference at line 118.
matrizXmatrizTests.c(116) : (col. 4) remark: loop was not vectorized: existence of vector dependence.
[/bash]

For the orders KIJ and IKJ I get:

[bash]matrizXmatrizTests.c(112) : (col. 2) remark: loop was not vectorized: not inner loop.
matrizXmatrizTests.c(114) : (col. 3) remark: loop was not vectorized: not inner loop.
matrizXmatrizTests.c(116) : (col. 4) remark: LOOP WAS VECTORIZED.
[/bash]

Why is not able icc to do loop interchange? Is not icc able to perform loop interchange when the inner loop contains pointers? I'm lost and I've tried to find at icc documentation some answer about why icc doesn't interchange the order of the loops but I didn't find anything. I have to test the power of the icc compiler in order to perform a performance study of compilers and nested loops. I am working on linux server with icc (ICC) 9.1 but soon I will haveIntel C++Compiler Professional Edition 11.1. Thanks in advance.

Dale_S_Intel · ‎04-23-2010

I haven't been able to reproduce your results with the 9.1 compiler, but I do get good results with the latest 11.1 compiler, with a few caveats. First, I had to add "#pragma ivdep" on at least the inner loop, because the compiler couldn't tell that the pointers were distinct (i.e. a, b and c are all pointers to pointers, so the "inic_*" functions might have changed some of them). Second, I could only get it working on IA32. For Intel64 it seems somewhat stymied by the pointer arithmetic involving 32 bit ints and 64 bit pointers. Having said that, if I add the pragma and compile for IA32 it vectorizes the inner loop for the original IKJ case. If I flip things around, it will interchange them so that J is the inner loop, so I believe that does what you want. Since the 9.1 compiler the loop restructuring opts have become much more integrated with the vectorizer, so that's not surprising. I'll look into what it takes to get it working on Intel64 code, but other than that, does that answer your question?

Thanks for the concise test case!

Dale

View solution in original post

Quoc-An_L_Intel · ‎04-23-2010

>>> I am working on linux server with icc (ICC) 9.1

Have you tried the latest 11.1 compiler? Product evaluation is free for 30 days.

Dale_S_Intel · ‎04-23-2010

I haven't been able to reproduce your results with the 9.1 compiler, but I do get good results with the latest 11.1 compiler, with a few caveats. First, I had to add "#pragma ivdep" on at least the inner loop, because the compiler couldn't tell that the pointers were distinct (i.e. a, b and c are all pointers to pointers, so the "inic_*" functions might have changed some of them). Second, I could only get it working on IA32. For Intel64 it seems somewhat stymied by the pointer arithmetic involving 32 bit ints and 64 bit pointers. Having said that, if I add the pragma and compile for IA32 it vectorizes the inner loop for the original IKJ case. If I flip things around, it will interchange them so that J is the inner loop, so I believe that does what you want. Since the 9.1 compiler the loop restructuring opts have become much more integrated with the vectorizer, so that's not surprising. I'll look into what it takes to get it working on Intel64 code, but other than that, does that answer your question?

Thanks for the concise test case!

Dale

windigo84 · ‎04-26-2010

Thanks Dale, it answers my question. I will wait until my lab has the new linux server installed with the icc 11.1 in the testing servers and will test your solution. Anyway I keep trying with 9.1 version because it is a little bit strange it does not interchange the matrix multiplication (it can also interchange for data locality, not only for vector purposes):

About the inic_ functions, I put there to simplify the code but then I realized the problem you mentioned: they can change the pointer addresses thus the compiler cannot know if they point to same memory spaces. Therefore I tested two solutions: with the -ip option and the inic functions inlined in the main; icc also didn't interchange for both solutions... I tried your solution (also with the inic_ functions inlined in the main) and it didn't interchange (with icc 9.1):

[cpp]int main()
{
	int dimi = 148; 
	int dimj = 148; 
	int dimk = 148; 

	int i, j, k;

	float **c = (float **)malloc(sizeof(float *)*dimi);
	c[0] = (float *)malloc(dimi * dimj * sizeof(float));
	for(i = 1; i < dimi; i++)
		c = c[0] + i * dimj;

	float **b = (float **)malloc(sizeof(float *)*dimk);
	b[0] = (float *)malloc(dimk * dimj * sizeof(float));
	for(i = 1; i < dimk; i++)
		b = b[0] + i * dimj;

	float **a = (float **)malloc(sizeof(float *)*dimi);
	a[0] = (float *)malloc(dimi * dimk * sizeof(float));
	for(i = 1; i < dimi; i++)
		a = a[0] + i * dimk;

	for (i=0; i = (i + j)/2.0; 
	
	for (i=0; i = 2.0; 
			
	for (i=0; i = 3.1; 

	
	for (j = 0; j < dimj; j++)
	{
		for (k = 0; k < dimk; k++)
		{
			#pragma ivdep
			for (i = 0; i < dimi; i++)
			{
				c += a*b;
			}

		}
	}

	return 0;
}[/cpp]

The response of the compiler for all the solutions where the J loop is not inner loop was (130 line is the line of the inner loop):

[bash]matrizXmatrizTests.c(122) : (col. 2) remark: loop was not vectorized: not inner loop.
matrizXmatrizTests.c(125) : (col. 3) remark: loop was not vectorized: not inner loop.
matrizXmatrizTests.c(130) : (col. 5) remark: loop was not vectorized: dereference too complex.
[/bash]

I'm still confused; as I told the loops can be interchanged for exploiting better the data locality, not only for vector purposes. As far as I know, icc was able to perform this kind of optimizations before 9.1 version. I understand that 11.1 version has better optimizations than the 9.1 version but anyway matrix multiply is one of the nested loop kernels which is easy to optimize at compile level. So I'm still believing I'm doing something wrong ;).

On the other hand, I have to try with Intel64 (I'm also trying some solutions with asm inline with sse3 instructions on the new nehalem servers and they have 8 extra registers for SIMD instructions which can only be used with Intel64), so if you found the solution please let me know, If I found it I will add to the thread too ;).

Thanks,

Alejandro

ps: sorry for my bad english ;), hard to explain this stuff...

Dale_S_Intel · ‎04-26-2010

I'm still confused; as I told the loops can be interchanged for exploiting better the data locality, not only for vector purposes. As far as I know, icc was able to perform this kind of optimizations before 9.1 version.

That's consistent with my understanding. The integration between our loop restructuring and vectorization was enhanced considerably (I think in 10.0) so it doesn't surprise me that it failed to interchange strictly for vectorization purposes before that point.

On the other hand, I have to try with Intel64 (I'm also trying some solutions with asm inline with sse3 instructions on the new nehalem servers and they have 8 extra registers for SIMD instructions which can only be used with Intel64), so if you found the solution please let me know, If I found it I will add to the thread too ;).

Thanks,

Alejandro

ps: sorry for my bad english ;), hard to explain this stuff...

I would highly encourage you to use SSE intrinsic functions rather than inline asm. Inline asm is hard to maintain and wreaks havoc with the compiler's ability to perform optimizations in and around the asm code. I'll keep looking into the Intel64 problem.

Dale

Dale_S_Intel · ‎04-26-2010

I've had some luck getting the MM loop to vectorize by using the option "-opt-subscript-in-range". According to the help page (icc -help):

-[no-]opt-subscript-in-range
assumes no overflows in the intermediate computation of the

This allows the compiler to do the convert of the 32 bit index variable to 64 bits for the pointer calculation without worrying about over/underflow problems (or something like that, anyway). But even with this option the simpler init loops ("c=c[0]+i*dimj;" &c.) don't get vectorized.

If you change all the 'int's to 'long int's, then everything seems to vectorize appropriately, so in the worst case you could define those types with an ifdef ("__x86_64__", I think) and it should vectorize. It's a lot easier than inline asm, or even intrinsics.

Does that help?

Thanks!
Dale

windigo84 · ‎04-27-2010

Yes Dale,

Thanks a lot. I'm still starting with the tests but it will be useful for my work.

Thanks again,

Alejandro

Dale_S_Intel · ‎04-27-2010

Great! Let me know if you find any other odd cases thta you think should vectorize but don't. We're always eager for good test cases.

Thanks!

Dale

windigo84 · ‎05-04-2010

I have had some more problems with the new compiler.

As Dale explained, changing to long int all the cases interchange the loop order (if needed) and vectorizes:

[bash]./archivosC/iccOptimized/matrizXmatrizJIK.c(20): (col. 2) remark: PERMUTED LOOP WAS VECTORIZED.
[/bash]

All the cases except one; when the order of the loops is JKI:

[cpp]int main(long int argc, char *argv[])
{
	long int dimi, dimj, dimk;
	
	dimi = 144;
	dimj = 144;
	dimk = 144;

	long int i, j, k;

	float **c = (float **)malloc(sizeof(float *)*dimi);
	c[0] = (float *)malloc(dimi * dimj * sizeof(float));
	for(i = 1; i < dimi; i++)
		c = c[0] + i * dimj;

	float **b = (float **)malloc(sizeof(float *)*dimk);
	b[0] = (float *)malloc(dimk * dimj * sizeof(float));
	for(i = 1; i < dimk; i++)
		b = b[0] + i * dimj;

	float **a = (float **)malloc(sizeof(float *)*dimi);
	a[0] = (float *)malloc(dimi * dimk * sizeof(float));
	for(i = 1; i < dimi; i++)
		a = a[0] + i * dimk;
		
			
	for (i=0; i = (i + j)/2.0; 
	
	for (i=0; i = 2.0; 
			
	for (i=0; i = 3.1; 

	for (j = 0; j < dimj; j++)
	{
		for (k = 0; k < dimk; k++)
		{
			#pragma ivdep
			for (i = 0; i < dimi; i++)
			{
				c += a*b;
			}

		}
	}
		
	return 0;
}[/cpp]

The icc compilers -vec-report3 returns (171 is the line number of the inner loop "i"):

[bash]./archivosC/iccOptimized/matrizXmatrizJKI.c(171): (col. 4) remark: loop was not vectorized: not inner loop.
./archivosC/iccOptimized/matrizXmatrizJKI.c(171): (col. 4) remark: loop was not vectorized: not inner loop.
./archivosC/iccOptimized/matrizXmatrizJKI.c(171): (col. 4) remark: loop was not vectorized: not inner loop.
./archivosC/iccOptimized/matrizXmatrizJKI.c(171): (col. 4) remark: loop was not vectorized: not inner loop.
./archivosC/iccOptimized/matrizXmatrizJKI.c(166): (col. 2) remark: loop was not vectorized: not inner loop.
./archivosC/iccOptimized/matrizXmatrizJKI.c(168): (col. 3) remark: loop was not vectorized: unsupported data type.
[/bash]

I can see that icc is interchanging the loop order because loop of 171 line is not anymore the inner loop. But I don't understand why when I use this order, JKI, it gives a message of unsupoorted data types... Changing the order of the loops, icc gets always the vectorization of the matrix product. But with JKI order icc doesn't. Does anybody know why is this happen?

Thanks in advance,

Alejandro