- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm currently working on speeding up some code I have. In doing so, I've found that my modifications run faster with PGI but slower with ifort. In the end, this might not be that important, but I thought I'd see if I could figure out a reason why this is happening. My first guess was (probably incorrectly) that PGI might be more aggressively vectorizing the code, so I turned on vec-report with ifort to see. In doing so, though, I encountered something...odd.
To wit, the options sent to the compiler are:
-O3 -ftz -align all -fno-alias -vec-report3 -convert big_endian -fp-model precise -align dcommons
where most of these are due to the rather large make structure of the complete codebase, but perhaps one of them is causing my oddity. When I compile the file in question and look at the vec-report for the loops I've played with (lines 2594 and 2601), I see this:
GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: LOOP WAS VECTORIZED.
GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: loop was not vectorized: vectorization possible but seems inefficient.
GC.F90(2600): (col. 42) remark: loop was not vectorized: not inner loop.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: LOOP WAS VECTORIZED.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: loop was not vectorized: vectorization possible but seems inefficient.
In italics, I've highlighted the parts that are confusing me. Were these loops vectorized or not? Are there two codepaths being generated? FYI, this code is being compiled on a Nehalem system, so am I seeing both Nehalem-specific and generic code generation?
Thanks for any help with this,
Matt
To wit, the options sent to the compiler are:
-O3 -ftz -align all -fno-alias -vec-report3 -convert big_endian -fp-model precise -align dcommons
where most of these are due to the rather large make structure of the complete codebase, but perhaps one of them is causing my oddity. When I compile the file in question and look at the vec-report for the loops I've played with (lines 2594 and 2601), I see this:
GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: LOOP WAS VECTORIZED.
GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: loop was not vectorized: vectorization possible but seems inefficient.
GC.F90(2600): (col. 42) remark: loop was not vectorized: not inner loop.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: LOOP WAS VECTORIZED.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: loop was not vectorized: vectorization possible but seems inefficient.
In italics, I've highlighted the parts that are confusing me. Were these loops vectorized or not? Are there two codepaths being generated? FYI, this code is being compiled on a Nehalem system, so am I seeing both Nehalem-specific and generic code generation?
Thanks for any help with this,
Matt
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
fp-model precise suppresses vectorization which may involve changes in numerical roundoff, including sum and dot product reductions, as well as most vector math library calls. Depending on your source code, additional vectorizations may be deemed "efficient" if you permit use of all instructions available on your CPU (options -xhost, -msse3, -msse4.1). Perhaps you set some equivalent option in your PGI build.
I would think that setting -fp-model precise would cancel the preceding -ftz, but that would not bear on your question.
In order to get the correctness features of fp-model precise without disabling optimizations, you would set
-assume protect_parens -prec-div -prec-sqrt.
I would think that setting -fp-model precise would cancel the preceding -ftz, but that would not bear on your question.
In order to get the correctness features of fp-model precise without disabling optimizations, you would set
-assume protect_parens -prec-div -prec-sqrt.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
fp-model precise suppresses vectorization which may involve changes in numerical roundoff, including sum and dot product reductions, as well as most vector math library calls. Depending on your source code, additional vectorizations may be deemed "efficient" if you permit use of all instructions available on your CPU (options -xhost, -msse3, -msse4.1). Perhaps you set some equivalent option in your PGI build.
I would think that setting -fp-model precise would cancel the preceding -ftz, but that would not bear on your question.
In order to get the correctness features of fp-model precise without disabling optimizations, you would set
-assume protect_parens -prec-div -prec-sqrt.
I would think that setting -fp-model precise would cancel the preceding -ftz, but that would not bear on your question.
In order to get the correctness features of fp-model precise without disabling optimizations, you would set
-assume protect_parens -prec-div -prec-sqrt.
fp-model precise does, in the absence of -ftz, set -no-ftz. However, if you explicitly call out -ftz along with -fp-model precise it will override that aspect of floating point precise model and do the ftz. FTZ is NOT precise by default, but as in this case one can override that.
As Tim mention, -O3 is good but I'd throw in -xhost also.
ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Ronald W. Green (Intel)
fp-model precise does, in the absence of -ftz, set -no-ftz. However, if you explicitly call out -ftz along with -fp-model precise it will override that aspect of floating point precise model and do the ftz. FTZ is NOT precise by default, but as in this case one can override that.
As Tim mention, -O3 is good but I'd throw in -xhost also.
ron
One other note: -fp-model precise on Intel compilers is equivalent to -Kieee on PGI. So if you are not using -Kieee on PGI you're comparing apples to oranges.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Ronald W. Green (Intel)
fp-model precise does, in the absence of -ftz, set -no-ftz. However, if you explicitly call out -ftz along with -fp-model precise it will override that aspect of floating point precise model and do the ftz. FTZ is NOT precise by default, but as in this case one can override that.
As Tim mention, -O3 is good but I'd throw in -xhost also.
ron
I'm not the final arbiter of the compile options, so I'm loathe to remove fp-model precise as I'm sure a good reason for it has been noted. But, I'm not averse to adding flags, so I decided to use all (well, only -xhost as it seems to conflict/be superseded by -m
-O3 -ftz -align all -fno-alias -vec-report3 -xhost -assume protect_parens -prec-div -prec-sqrt -convert big_endian -fp-model precise -align dcommons
ETA: Oops. It was pointed out to me that Tim was most likely saying to substitute '-assume protect_parens -prec-div -prec-sqrt' for '-fp-model precise'. Instead, using:
-O3 -vec-report3 -xHost -assume protect_parens -prec-div -prec-sqrt -ftz -align all -fno-alias -convert big_endian -align dcommons
However, the results are the same. In doing so, I now see:
GC.F90(2595): (col. 10) remark: loop was not vectorized: not inner loop.
GC.F90(2594): (col. 13) remark: LOOP WAS VECTORIZED.
GC.F90(2600): (col. 42) remark: loop was not vectorized: not inner loop.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: LOOP WAS VECTORIZED.
GC.F90(2602): (col. 13) remark: loop was not vectorized: not inner loop.
GC.F90(2601): (col. 16) remark: loop was not vectorized: vectorization possible but seems inefficient.
It looks like one of the loops was vectorized, though the other is conflicted still.
So, if I see both a "was vectorized" and a "was not vectorized" should I assume that the "not" wins out? I just ask because it's not just in this part of the code I see the conflicting statements, but all throughout the report:
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: LOOP WAS VECTORIZED.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: vectorization possible but seems inefficient.
for the loop:
Line 1658: FSWNA = 0.0
where FSWNA is a (pointer to a) 3-D array.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Ronald W. Green (Intel)
One other note: -fp-model precise on Intel compilers is equivalent to -Kieee on PGI. So if you are not using -Kieee on PGI you're comparing apples to oranges.
Oop. Didn't see this until I replied. With PGI I am using:
-fast -Kieee -Ktrap=fp
along with an -Minfo and other flags. -fast might be much more aggressive than the ifort options being used:
> pgfortran -fast -help
Reading rcfile /opt/pgi/linux86-64/10.0/bin/.pgfortranrc
-fast Common optimizations; includes -O2 -Munroll=c:1 -Mnoframe -Mlre -Mautoinline
== -Mvect=sse -Mscalarsse -Mcache_align -Mflushz -Mpre
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - thematt
So, if I see both a "was vectorized" and a "was not vectorized" should I assume that the "not" wins out? I just ask because it's not just in this part of the code I see the conflicting statements, but all throughout the report:
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: LOOP WAS VECTORIZED.
GC.F90(1658): (col. 11) remark: loop was not vectorized: not inner loop.
GC.F90(1658): (col. 11) remark: loop was not vectorized: vectorization possible but seems inefficient.
for the loop:
Line 1658: FSWNA = 0.0
where FSWNA is a (pointer to a) 3-D array.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - thematt
Oop. Didn't see this until I replied. With PGI I am using:
-fast -Kieee -Ktrap=fp

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page