- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey, I recently experienced slow performance for a small toy program (attached) compiled with ifort 14.0.3 and -O3/-Ofast compared to -O2 and also compared to gfortran -O2/-O3.
When using gfortran 4.9.0:
gfortran -O2 -o fannkuch_gcc fannkuch.f90 && time ./fannkuch_gcc 11
takes 3.15s, and with -O3 it takes 2.75s.
When using ifort 14.0.3:
ifort -O2 -o fannkuch_intel fannkuch.f90 && time ./fannkuch_intel 11
it takes 3.03s, but with -O3 or -Ofast it goes to 4.8s. When replacing the array copy with an explicit loop in the source code, performance is better, but still worse than -O2 and nowhere near gfortran's -O3. I didn't spot any obvious differences with -vec-report or -opt-report.
I realize this is just a tiny program, so maybe it's normal to expect some over-optimization problems?
Steven
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree it's somewhat strange that you must turn off interprocedural "optimizations" to avoid a slowdown at -O3, yet little shows up in the reports beyond the additional complaints about uncountable loops. Apparently, the compiler has thought about further optimizations on the inner loop but given up.
Contrary to advertisements, my experience has been that ifort doesn't optimize do while loops as well as a plain counted do loop (which looks like it could be used to streamline the inner loop). You seem to have remarked yourself on questions about array assignment [and substitution of memcpy].
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In looking at the inner most loop, inside function flip, this swaps a section of an array that contains at most max_n (12) entries. For this I suggest replacing the two nest while loops with a SELECT.. CASE or chain of IF THEN ELSEIF where each section performs swaps of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12 entries.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looked reasonable to me to replace the inner loop with an old-fashioned DO loop, but I had to add ivdep and loop count max(5) directives and set -xHost, to approach the situation as originally reported. The compiler reports a max loop count of 12 in some situations so it doesn't fall to a default assumption of 100, but, with vectorization by directive, it doesn't optimize the remainder loop which will be where all the time is spent.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the suggestions!
Replacing the innermost while loop with a do loop was indeed not that successful. On the other hand, using select case was better, but the improvement was marginal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first iteration of the inner loop can be optimized:
! perm(1:n)=p(1:n) i=1 j=lead do while (i.lt.j) perm(j) = p(i) perm(i)=p(j) i = i + 1 j = j - 1 end do if(i.eq.j) perm(i) = p(i) flip = flip + 1 lead = perm(1) do while (lead.ne.1) i=1 j=lead do while (i.lt.j) tmp = perm(i) perm(i)=perm(j) i = i + 1 perm(j)=tmp j = j - 1 end do flip = flip + 1 lead = perm(1) end do
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page