Thanks for the suggestions!

Steven_V_ · ‎05-14-2014

Hey, I recently experienced slow performance for a small toy program (attached) compiled with ifort 14.0.3 and -O3/-Ofast compared to -O2 and also compared to gfortran -O2/-O3.

When using gfortran 4.9.0:

gfortran -O2 -o fannkuch_gcc fannkuch.f90 && time ./fannkuch_gcc 11

takes 3.15s, and with -O3 it takes 2.75s.

When using ifort 14.0.3:

ifort -O2 -o fannkuch_intel fannkuch.f90 && time ./fannkuch_intel 11

it takes 3.03s, but with -O3 or -Ofast it goes to 4.8s. When replacing the array copy with an explicit loop in the source code, performance is better, but still worse than -O2 and nowhere near gfortran's -O3. I didn't spot any obvious differences with -vec-report or -opt-report.

I realize this is just a tiny program, so maybe it's normal to expect some over-optimization problems?

Steven

TimP · ‎05-14-2014

I agree it's somewhat strange that you must turn off interprocedural "optimizations" to avoid a slowdown at -O3, yet little shows up in the reports beyond the additional complaints about uncountable loops. Apparently, the compiler has thought about further optimizations on the inner loop but given up.

Contrary to advertisements, my experience has been that ifort doesn't optimize do while loops as well as a plain counted do loop (which looks like it could be used to streamline the inner loop). You seem to have remarked yourself on questions about array assignment [and substitution of memcpy].

jimdempseyatthecove · ‎05-14-2014

In looking at the inner most loop, inside function flip, this swaps a section of an array that contains at most max_n (12) entries. For this I suggest replacing the two nest while loops with a SELECT.. CASE or chain of IF THEN ELSEIF where each section performs swaps of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 entries.

Jim Dempsey

TimP · ‎05-14-2014

It looked reasonable to me to replace the inner loop with an old-fashioned DO loop, but I had to add ivdep and loop count max(5) directives and set -xHost, to approach the situation as originally reported. The compiler reports a max loop count of 12 in some situations so it doesn't fall to a default assumption of 100, but, with vectorization by directive, it doesn't optimize the remainder loop which will be where all the time is spent.

Steven_V_ · ‎05-14-2014

Thanks for the suggestions!

Replacing the innermost while loop with a do loop was indeed not that successful. On the other hand, using select case was better, but the improvement was marginal.

jimdempseyatthecove · ‎05-18-2014

The first iteration of the inner loop can be optimized:

! perm(1:n)=p(1:n)
i=1
j=lead
do while (i.lt.j)
  perm(j) = p(i)
  perm(i)=p(j)
  i = i + 1
  j = j - 1
end do
if(i.eq.j) perm(i) = p(i)
flip = flip + 1
lead = perm(1)
do while (lead.ne.1)
  i=1
  j=lead
  do while (i.lt.j)
    tmp = perm(i)
    perm(i)=perm(j)
    i = i + 1
    perm(j)=tmp
    j = j - 1
  end do
  flip = flip + 1
  lead = perm(1)
end do

Jim Dempsey

performance regressions with -O3/-Ofast