Loops were both vectorized and not vectorized?

Matt_Thompson · ‎01-05-2010

I'm currently working on speeding up some code I have. In doing so, I've found that my modifications run faster with PGI but slower with ifort. In the end, this might not be that important, but I thought I'd see if I could figure out a reason why this is happening. My first guess was (probably incorrectly) that PGI might be more aggressively vectorizing the code, so I turned on vec-report with ifort to see. In doing so, though, I encountered something...odd.

To wit, the options sent to the compiler are:

-O3 -ftz -align all -fno-alias -vec-report3 -convert big_endian -fp-model precise -align dcommons

where most of these are due to the rather large make structure of the complete codebase, but perhaps one of them is causing my oddity. When I compile the file in question and look at the vec-report for the loops I've played with (lines 2594 and 2601), I see this:

GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: LOOP WAS VECTORIZED.
GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: loop was not vectorized: vectorization possible but seems inefficient.
GC.F90(2600): (col. 42) remark: loop was not vectorized: not inner loop.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: LOOP WAS VECTORIZED.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: loop was not vectorized: vectorization possible but seems inefficient.

In italics, I've highlighted the parts that are confusing me. Were these loops vectorized or not? Are there two codepaths being generated? FYI, this code is being compiled on a Nehalem system, so am I seeing both Nehalem-specific and generic code generation?

Thanks for any help with this,
Matt

TimP · ‎01-05-2010

fp-model precise suppresses vectorization which may involve changes in numerical roundoff, including sum and dot product reductions, as well as most vector math library calls. Depending on your source code, additional vectorizations may be deemed "efficient" if you permit use of all instructions available on your CPU (options -xhost, -msse3, -msse4.1). Perhaps you set some equivalent option in your PGI build.
I would think that setting -fp-model precise would cancel the preceding -ftz, but that would not bear on your question.
In order to get the correctness features of fp-model precise without disabling optimizations, you would set
-assume protect_parens -prec-div -prec-sqrt.

Ron_Green · ‎01-05-2010

Quoting - tim18

fp-model precise suppresses vectorization which may involve changes in numerical roundoff, including sum and dot product reductions, as well as most vector math library calls. Depending on your source code, additional vectorizations may be deemed "efficient" if you permit use of all instructions available on your CPU (options -xhost, -msse3, -msse4.1). Perhaps you set some equivalent option in your PGI build.
I would think that setting -fp-model precise would cancel the preceding -ftz, but that would not bear on your question.
In order to get the correctness features of fp-model precise without disabling optimizations, you would set
-assume protect_parens -prec-div -prec-sqrt.

fp-model precise does, in the absence of -ftz, set -no-ftz. However, if you explicitly call out -ftz along with -fp-model precise it will override that aspect of floating point precise model and do the ftz. FTZ is NOT precise by default, but as in this case one can override that.

As Tim mention, -O3 is good but I'd throw in -xhost also.

ron

Ron_Green · ‎01-05-2010

Quoting - Ronald W. Green (Intel)

fp-model precise does, in the absence of -ftz, set -no-ftz. However, if you explicitly call out -ftz along with -fp-model precise it will override that aspect of floating point precise model and do the ftz. FTZ is NOT precise by default, but as in this case one can override that.

As Tim mention, -O3 is good but I'd throw in -xhost also.

ron

One other note: -fp-model precise on Intel compilers is equivalent to -Kieee on PGI. So if you are not using -Kieee on PGI you're comparing apples to oranges.

Matt_Thompson · ‎01-05-2010

Quoting - Ronald W. Green (Intel)

fp-model precise does, in the absence of -ftz, set -no-ftz. However, if you explicitly call out -ftz along with -fp-model precise it will override that aspect of floating point precise model and do the ftz. FTZ is NOT precise by default, but as in this case one can override that.

As Tim mention, -O3 is good but I'd throw in -xhost also.

ron

I'm not the final arbiter of the compile options, so I'm loathe to remove fp-model precise as I'm sure a good reason for it has been noted. But, I'm not averse to adding flags, so I decided to use all (well, only -xhost as it seems to conflict/be superseded by -m flags) your suggestions:

-O3 -ftz -align all -fno-alias -vec-report3 -xhost -assume protect_parens -prec-div -prec-sqrt -convert big_endian -fp-model precise -align dcommons

ETA: Oops. It was pointed out to me that Tim was most likely saying to substitute '-assume protect_parens -prec-div -prec-sqrt' for '-fp-model precise'. Instead, using:

-O3 -vec-report3 -xHost -assume protect_parens -prec-div -prec-sqrt -ftz -align all -fno-alias -convert big_endian -align dcommons

However, the results are the same. In doing so, I now see:

GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: LOOP WAS VECTORIZED.
GC.F90(2600): (col. 42) remark: loop was not vectorized: not inner loop.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: LOOP WAS VECTORIZED.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: loop was not vectorized: vectorization possible but seems inefficient.

It looks like one of the loops was vectorized, though the other is conflicted still.

So, if I see both a "was vectorized" and a "was not vectorized" should I assume that the "not" wins out? I just ask because it's not just in this part of the code I see the conflicting statements, but all throughout the report:

GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: LOOP WAS VECTORIZED.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: vectorization possible but seems inefficient.

for the loop:

Line 1658: FSWNA = 0.0

where FSWNA is a (pointer to a) 3-D array.

Matt_Thompson · ‎01-05-2010

Quoting - Ronald W. Green (Intel)

One other note: -fp-model precise on Intel compilers is equivalent to -Kieee on PGI. So if you are not using -Kieee on PGI you're comparing apples to oranges.

Oop. Didn't see this until I replied. With PGI I am using:

-fast -Kieee -Ktrap=fp

along with an -Minfo and other flags. -fast might be much more aggressive than the ifort options being used:

> pgfortran -fast -help
Reading rcfile /opt/pgi/linux86-64/10.0/bin/.pgfortranrc
-fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline
== -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre

TimP · ‎01-05-2010

Quoting - thematt

So, if I see both a "was vectorized" and a "was not vectorized" should I assume that the "not" wins out? I just ask because it's not just in this part of the code I see the conflicting statements, but all throughout the report:

GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: LOOP WAS VECTORIZED.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: vectorization possible but seems inefficient.

for the loop:

Line 1658: FSWNA = 0.0

where FSWNA is a (pointer to a) 3-D array.

I agree these messages are difficult to interpret with confidence. Apparently, the 3-D assignment has been expanded into nested loops, and the inner loop has been vectorized. If it turns out to be a site where significant time is spent, it may require a closer look, for example examining the asm code.

TimP · ‎01-05-2010

Quoting - thematt

Oop. Didn't see this until I replied. With PGI I am using:

-fast -Kieee -Ktrap=fp

ifort has a -fast option similar to the pgf90 one. It includes -xhost -ipo -O3.