Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

performance regressions with -O3/-Ofast

Steven_V_
Beginner
789 Views

Hey, I recently experienced slow performance for a small toy program (attached) compiled with ifort 14.0.3 and -O3/-Ofast compared to -O2 and also compared to gfortran -O2/-O3.

When using gfortran 4.9.0:

gfortran -O2 -o fannkuch_gcc fannkuch.f90 && time ./fannkuch_gcc 11

takes 3.15s, and with -O3 it takes 2.75s.

When using ifort 14.0.3:

ifort -O2 -o fannkuch_intel fannkuch.f90 && time ./fannkuch_intel 11

it takes 3.03s, but with -O3 or -Ofast it goes to 4.8s. When replacing the array copy with an explicit loop in the source code, performance is better, but still worse than -O2 and nowhere near gfortran's -O3. I didn't spot any obvious differences with -vec-report or -opt-report.

I realize this is just a tiny program, so maybe it's normal to expect some over-optimization problems?

Steven

0 Kudos
5 Replies
TimP
Honored Contributor III
789 Views

I agree it's somewhat strange that you must turn off interprocedural "optimizations" to avoid a slowdown at -O3, yet little shows up in the reports beyond the additional complaints about uncountable loops.  Apparently, the compiler has thought about further optimizations on the inner loop but given up.

Contrary to advertisements, my experience has been that ifort doesn't optimize do while loops as well as a plain counted do loop (which looks like it could be used to streamline the inner loop).  You seem to have remarked yourself on questions about array assignment [and substitution of memcpy].

0 Kudos
jimdempseyatthecove
Honored Contributor III
789 Views

In looking at the inner most loop, inside function flip, this swaps a section of an array that contains at most max_n (12) entries. For this I suggest replacing the two nest while loops with a SELECT.. CASE or chain of IF THEN ELSEIF where each section performs swaps of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 entries.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
788 Views

It looked reasonable to me to replace the inner loop with an old-fashioned DO loop, but I had to add ivdep and loop count max(5) directives and set -xHost, to approach the situation as originally reported.  The compiler reports a max loop count of 12 in some situations so it doesn't fall to a default assumption of 100, but, with vectorization by directive, it doesn't optimize the remainder loop which will be where all the time is spent.

0 Kudos
Steven_V_
Beginner
789 Views

Thanks for the suggestions!

Replacing the innermost while loop with a do loop was indeed not that successful. On the other hand, using select case was better, but the improvement was marginal.

0 Kudos
jimdempseyatthecove
Honored Contributor III
789 Views

The first iteration of the inner loop can be optimized:

! perm(1:n)=p(1:n)
i=1
j=lead
do while (i.lt.j)
  perm(j) = p(i)
  perm(i)=p(j)
  i = i + 1
  j = j - 1
end do
if(i.eq.j) perm(i) = p(i)
flip = flip + 1
lead = perm(1)
do while (lead.ne.1)
  i=1
  j=lead
  do while (i.lt.j)
    tmp = perm(i)
    perm(i)=perm(j)
    i = i + 1
    perm(j)=tmp
    j = j - 1
  end do
  flip = flip + 1
  lead = perm(1)
end do

Jim Dempsey

0 Kudos
Reply