Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29343 Discussions

/Qsave and /Qzero optimization degradation

Ivan_F_
Beginner
2,030 Views

We have noticed a rather big change in the optimizer performance starting with the version 13 compiler when targetting the Intel 64 architecture . This change is still present in the 2013SP1 update 1.

When the options /Qsave and /Qzero are both used, with some code the compiler  optimizer seems can be completely turned off  (i.e. /O2 or /O3 makes no difference).

This problem does not exist with the V12.0 Fortran compiler where the optimization is not affected by the conjunction of these options.

It is possible to emphasize the problem by compiling the WHET8.FOR whetstone benchmark with the options /O2 /Qsave /Qzero with both the  2011 (u5) and the 2013 SP1 u1.

In this example, there is a 3 times execution speed degradation between V12.0   and  V13 or V14.

We notice the same performance drop in some other code. Removing /Qsave or /Qzero allows a correct optimization, but it cannot be done with all our code.

Is there a additional compiler option  to solve this problem, or can it be considered as a bug ?

This is a big problem for us.

Ivan Fontaine

0 Kudos
8 Replies
Steven_L_Intel1
Employee
2,030 Views

Would you please attach the particular source you're using? /Qsave and /Qzero will inhibit optimizations as it prevents the optimizer from keeping variables in registers.

0 Kudos
Ivan_F_
Beginner
2,030 Views

Hi Steve.

The source are part of the BMDS benchmark.

You'll find it attached.

There is no question about the fact that /Qsave and /Qzero makes optimization harder to achieve.

But as the the V12.0 compiler is able to optimize that kind of code, the compiler upgrade is considered as a big regression for some of our clients. We're talking about thousands of systems were we loose several hours per day.

Ivan Fontaine

0 Kudos
mecej4
Honored Contributor III
2,030 Views

The code in whet8.for does not appear to need /Qsave or /Qzero to work properly.

For the mainframes that this benchmark was run on in the 1970's, whose compilers used SAVE as the default or sole mode of operation, the benchmark may have made sense. I suspect that this artificial benchmark is now obsolete and its results have no bearing on the performance of real codes on hardware with multi-level cache memory.. 

0 Kudos
Ivan_F_
Beginner
2,030 Views

mecej4 wrote:

The code in whet8.for does not appear to need /Qsave or /Qzero to work properly.

For the mainframes that this benchmark was run on in the 1970's, whose compilers perated with SAVE as the default or sole mode of operation, the benchmark may have made sense. I suspect that this artificial benchmark is obsolete and its results have no bearing on the performance of real codes. 

This "benchmark" code  and its compilation parameters are  just a way to expose the optimization change behaviour between V12 and V13/V14 compilers.

The fact that the V13/V14 "benchmark "executable is 3 times slower than the V12 one is somewhat hard to explain.

We have some other production code where a problem of this kind exists. This code is huge and does not belong to us. We have no permission to transmit it or even modify it at our will. We're just facing a big performance regression (on the real production code the complete computation faces a 20% peformance regression compared to the V12 compiler ).  Even if the code is the production code was running correctly without these options we would have no permission to change the options... 

Ivan Fontaine

0 Kudos
mecej4
Honored Contributor III
2,030 Views

The following timing results, obtained by running your Whet8 code on an i3-2350M laptop with W8 Pro-x64 with the 14.0.1.139 (IA32), show that there is, indeed, a major impact of using /Qzero. On the other hand, the use of /Qsave alone seems to have a negligible impact.

With ifort /fast alone, the result printed by the program may be paraphrased as 13.83 gWhets/s. When /Qsave is added, the result changes slightly, giving 13.77 gWhets/s. When /Qzero is also added, the result changes drastically, dropping to 4.04 gWhets/s.

I agree that this is a somewhat unexpected slow-down and that it may bear further investigation. However, it is conceivable that the optimizer may remove much of the calculations if it determines that the results of those calculations are not used later in the program, defeating the purpose of running the toy benchmark. If so, what is faulty is the seemingly high speed that is reported when /Qzero is not used. In other words, we may see a misleadingly large speed up when variables are not required to be set to zero and the results are not reused.

I attach a pared-down version of your program which contains no subroutines and no variables that are used without initialization. Therefore, the two options under discussion have no effect on the results calculated; however, the running time can be affected by using these options. With the current 14.0.1.139 (IA32) compiler, the run times are 2 seconds and 82 seconds, the latter when -Qsave and -Qzero are used. With the ancient IFort 7.0 IA32 compiler, the run times are 15 seconds whether or not -Qsave and -Qzero are used. It seems that the current compiler does a great job of optimizing modern Fortran code but outputs slow EXE code when options needed for legacy codes, such as -Qzero, are used.

0 Kudos
TimP
Honored Contributor III
2,030 Views

If you examine the opt-report, you will see that optimization of loop invariants and collapsing of loops which store the same values repeatedly is performed when /Qzero is omitted.  For at least 25 years, most serious benchmarks which rely on repeating the same calculations take steps to hide this from the compiler.

There is certainly value in detection of loop invariants but "real" applications don't normally have loops full of nothing but invariants.

0 Kudos
Ivan_F_
Beginner
2,030 Views

Well, at least I've found a portion of -real- code that the V13 and V14 compiler are unable to optimize while the V12 seems to perform really well. . When profiling it the "L" loop takes 6% of the time and the inner loop lines 2.5., 2.9, 2.9 and 2.7% of a total of 18.6%.

 With the V12 compiler the whole thing is less than 1% of the total computing time.

So I think that there is indeed a major problem with the v13/v14 optimizer when both the /Qsave and /Qzero options are used. With the V14 compiler no optimization at all is done in that portion of code when /Qsave and /Qzero options are enabled.

  DO M=1,SA

   DO N=1,SA_Y

DO I=1,SFX
DO K=1,SFX
DO J=1,SFY
DO L=1,SFY
A (I,K,J,L)=A (I,K,J,L)+XX(M,N)*IA (M,N,I,K,J,L)
B (I,K,J,L)=B (I,K,J,L)+YY(M,N)*IB (M,N,I,K,J,L)
GA(I,K,J,L)=GA(I,K,J,L)+XY(M,N)*IC (M,N,I,K,J,L)
GX(I,K,J,L)=GX(I,K,J,L)+XX(M,N)*ID( M,N,I,K,J,L)*G(M,N)
END DO
END DO
END DO
END DO
END DO
END DO

As the Premier Support seems to be broken, if ever Intel wants some more information, please get in touch with me. I cannot give more details on a public forum. It is really important for us. 

Ivan Fontaine

0 Kudos
Steven_L_Intel1
Employee
2,030 Views

How is Intel Premier Support broken? You can use "Send Author a Message" to send me the details and test cases.

0 Kudos
Reply