Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7942 Discussions

Optimization Questions- Analysis remarks

Yolanda_Maxwell
Beginner
520 Views
I'm using the Intel XE Compiler 12.0 for C++ and using it with Visual Studio 2008 and trying very hard to optimize my application. At this point I'm working on Vectorization and have read a lot of infomation before starting the effort. I have also been using the guided optimization tool. My question is in regard to the remarks that the tool outputs.
Example:
1>H:\\Projects\\Repository-nonSVN\\PPI_InQuarters\\PPI-90\\Build_LUTables.cpp(94): (col. 3) remark: LOOP WAS VECTORIZED.
1>H:\\Projects\\Repository-nonSVN\\PPI_InQuarters\\PPI-90\\Build_LUTables.cpp(94): (col. 3) remark: loop skipped: multiversioned.
OR
1>H:\\Projects\\Repository-nonSVN\\PPI_InQuarters\\PPI-90\\Build_LUTables.cpp(301): (col. 3) remark: LOOP WAS VECTORIZED.
1>H:\\Projects\\Repository-nonSVN\\PPI_InQuarters\\PPI-90\\Build_LUTables.cpp(301): (col. 3) remark: PARTIAL LOOP WAS VECTORIZED.
Please notice that there are two examples and they both indicated two different things happing to the same line of code. I'm having a hard time figuring out what they are really saying. Was the loop Vectorized or not? And in the second case was it only partially done or completely done?
0 Kudos
12 Replies
jimdempseyatthecove
Honored Contributor III
520 Views
Without seeing the source code of line 301 (+/- a few lines) it is hard to say.
Does line 301 contain a MACRO or template use where nested loops are used?

Jim Dempsey
0 Kudos
Yolanda_Maxwell
Beginner
520 Views
I have at this point made several changes to the code and don't seem to be getting the two different messages at 301. I'm still getting the messages at 94 and recieve them at several locations. If I recall correctly the messages at 301 where the start of the first of two inter loops not nested but parallel to each other. They were something like this:
double LocationSecondary[16][1024] ;
double LocationPrim[1024] ; // Init was done earlier
double Delta[8] ;// Init was done earlier
for(int ii= 0 ; ii< 1024; ii++, LocationPrim++)
{
for(int jj = 0 ; jj< 8; jj++, LocationSecondary++ )
{
*LocationSecondary = *LocationPrim + Delta[jj] ;
}
for(int jj = 0 ; jj< 8 ; jj++,LocationSecondary++)
{
*LocationSecondary = *LocationPrim - Delta[jj] ;
}
}
Line 301 would be at the first of inner for loops.
For the other messages, am I understanding correctly that the loop was compiled by both vectorizing and serializedand then the vectorized one was used?
0 Kudos
Om_S_Intel
Employee
520 Views
I could not compile the code sengemt:

c:\>type tstcase.cpp

void foo()

{

double LocationSecondary[16][1024] ;

double LocationPrim[1024] ; // Init was done earlier

double Delta[8] ; // Init was done earlier

for(int ii= 0 ; ii< 1024; ii++, LocationPrim++)

{

for(int jj = 0 ; jj< 8; jj++, LocationSecondary++ )

{

*LocationSecondary = *LocationPrim + Delta[jj] ;

}

for(int jj = 0 ; jj< 8 ; jj++, LocationSecondary++)

{

*LocationSecondary = *LocationPrim - Delta[jj] ;

}

}
}


c:\>icl -c /Qvec-report3 tstcase.cpp

Intel C++ Intel 64 Compiler XE for applications running on Intel 64, Ve

rsion 12.1.0.233 Build 20110811

Copyright (C) 1985-2011 Intel Corporation. All rights reserved.

tstcase.cpp

tstcase.cpp(7): error: expression must be a modifiable lvalue

for(int ii= 0 ; ii< 1024; ii++, LocationPrim++)

^

tstcase.cpp(9): error: expression must be a modifiable lvalue

for(int jj = 0 ; jj< 8; jj++, LocationSecondary++ )

^

tstcase.cpp(11): error: expression must be a modifiable lvalue

*LocationSecondary = *LocationPrim + Delta[jj] ;

^

tstcase.cpp(13): error: expression must be a modifiable lvalue

for(int jj = 0 ; jj< 8 ; jj++, LocationSecondary++)

^

tstcase.cpp(15): error: expression must be a modifiable lvalue

*LocationSecondary = *LocationPrim - Delta[jj] ;

^

compilation aborted for tstcase.cpp (code 2)

0 Kudos
jimdempseyatthecove
Honored Contributor III
520 Views
double LocationSecondary[16][1024] ;

Is not a modifiable lvalue type

typedef double[16][1024] d16x1024;

d16x24 LocationSecondary = new d16x24; // allocate n copies of d16x1024
...
for(int jj = 0 ; jj< 8; jj++, LocationSecondary++ )

now the above is valid
***
However, LocationSecondary++ advances to the 2nd d16x24 in the array
(and the original pointer is modified)

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
520 Views
double LocationSecondary[16][1024] ;

Is not a modifiable lvalue type

typedef double[16][1024] d16x1024;

d16x24 LocationSecondary = new d16x24; // allocate n copies of d16x1024
...
for(int jj = 0 ; jj< 8; jj++, LocationSecondary++ )

now the above is valid
***
However, LocationSecondary++ advances to the 2nd d16x24 in the array
(and the original pointer is modified)

Jim Dempsey
0 Kudos
levicki
Valued Contributor I
520 Views
I wonder if that code can be rewritten like this:
[cpp]double LocationSecondary[16][1024];
double LocationPrimary[1024];
double Delta[8];

for (int i = 0; i < 1024; i++) {
	for (int j = 0; j < 8; j++) {
		LocationSecondary = LocationPrimary + Delta;
	}
	for (int j = 0; j < 8; j++) {
		LocationSecondary[8 + j] = LocationPrimary - Delta;
	}
}
[/cpp]
What do you think Jim?

If what I did is correct, then I would also suggest the following change:
[cpp]double LocationSecondary[16][1024];
double LocationPrim[1024];
double Delta[16];	// copy delta[0-7] to delta[8-15] negated
			// so that you can use + operator for both

for (int i = 0; i < 1024; i++) {
	for (int j = 0; j < 16; j++) {
		LocationSecondary = LocationPrim + Delta;
	}
}
[/cpp]
Moreover, loops should be reversed because having large stride is inefficient.

Finally, the answer to Yolanda's question about remarks is that the compiler sometimes employs loop transformation so that loops get split into vectorizable and partially vectorizable code if compiler determines it can do that safely -- that is why you get two remarks for the same loop.
0 Kudos
TimP
Honored Contributor III
520 Views
As Igor suggests, at -O3, the compiler is attempting to find a more efficient arrangement of these loops, accounting for the optimization remarks about multiple versions. In principle, if you can write source code which corresponds more closely with efficient vectorization, you may achieve even better results.
0 Kudos
jimdempseyatthecove
Honored Contributor III
520 Views
Igor,

I think it would be more xmm register efficient to use:

[cpp]double LocationSecondary[16][1024];
double LocationPrimary[1024];
double Delta[8];

for (int i = 0; i < 1024; i++) {
	for (int j = 0; j < 8; j++) {
		LocationSecondary = LocationPrimary + Delta;
		LocationSecondary[8 + j] = LocationPrimary - Delta;
	}
}[/cpp]

On SSE all of Delta could be xmm registerized into 4 registers. On x32 this would leave 4 xmm registers available for scratch. I agree with you that the user should consider

double LocationSecondary[1024][16]; // swap indexes
double LocationPrimary[1024];
double Delta[8];

for (int i = 0; i < 1024; i++) {
for (int j = 0; j < 8; j++) {
LocationSecondary = LocationPrimary + Delta;
LocationSecondary[8 + j] = LocationPrimary - Delta;
}
}

Provided that swap of index does not introduce performance penalty elsewhere.

** run a test to confirm performance change **

Jim Dempsey
0 Kudos
levicki
Valued Contributor I
520 Views

This is a typical loop that would benefit from 3 operand syntax (since LocationPrimary could be reused without copying), and from AVX.

0 Kudos
Yolanda_Maxwell
Beginner
520 Views
Thank you to Jim, Igor and Tim for the suggestions and help.

I will have to look into AVX a little more, not sure that it will befit this project. I need to be able to run this application on existing Pentium D class hardware.
So, back to the question of analysis remarks, if as Tim and Igor stated, the compiler is attempting to find more efficient arrangement for the loops and that can account for the multiple remarks, which remark is the one that really happens?
0 Kudos
TimP
Honored Contributor III
520 Views
When the compiler generates multiple versions of the loop, you probably have to resort to run-time profiling to find out which version is executed with your data set. This may give you an incentive to write a source code version which doesn't require as much modification to optimize, as the others suggested.
Pentium D excludes use of architecture options more recent than SSE3.
0 Kudos
levicki
Valued Contributor I
520 Views
All remarks apply. the loop has been split into fully and partially vectorizeable statements and they will be run sequentially because compiler believes (based on cost modeling) that doing so will be faster than mixing the two.
0 Kudos
Reply