Fortran 90 intrinsic functions slow?

Eric3 · ‎09-04-2012

Hello everybody,

I am trying to reduce the execution time of an existing program for a 2D flow simulation, which was originally written in Fortran 77. Since i am quite new to Fortran programming, at first i tried to gather some information about efficient coding. Many times i was recommended to make use of the "new" intrinsic functions, since this way the compiler would be told whats going on and could perform some optimizations (e.g. vectorization). Although, these functions and the use of vector expressions instead of scalar ones made my code much more compact and elegant, the computation run time increased to my surprise.

These are some code snippets in the old and new version, and the respective required cpu time (3 runs each):

===========================================================================
1. Example CSHIFT: performs a shift of the elements in a matrix by one element along all directions (including diagonals)
===========================================================================

old: t = 8.777; 8.553; 8.789 s
-----------------------------------
do j=1,nj
do i=1,ni
    ie = mod(i,ni) + 1
    iw = ni - mod(ni+1-i,ni)
    jn = mod(j,nj) + 1
    js = nj - mod(nj+1-j,nj)

    fn(ie,j ,1) = f(i,j,1)
    fn(i ,jn,2) = f(i,j,2)
    fn(iw,j ,3) = f(i,j,3)
    fn(i ,js,4) = f(i,j,4)
    fn(ie,jn,5) = f(i,j,5)
    fn(iw,jn,6) = f(i,j,6)
    fn(iw,js,7) = f(i,j,7)
    fn(ie,js,8) = f(i,j,8)
    fn(i ,j ,0) = f(i,j,0)
enddo
enddo
----------------------------------------
new: t = 11.009; 11.241; 11,033 s
-----------------------------------------
fn(:,:,0) = f(:,:,0)
fn(:,:,1) = cshift(f(:,:,1),-1,1)
fn(:,:,2) = cshift(f(:,:,2),-1,2)
fn(:,:,3) = cshift(f(:,:,3),1,1)
fn(:,:,4) = cshift(f(:,:,4),1,2)
fn(:,:,5) = cshift(cshift(f(:,:,5),-1,1),-1,2)
fn(:,:,6) = cshift(cshift(f(:,:,6),1,1),-1,2)
fn(:,:,7) = cshift(cshift(f(:,:,7),1,1),1,2)
fn(:,:,8) = cshift(cshift(f(:,:,8),-1,1),1,2)

=====================================================
2. Example WHERE: assigns new values to a matrix where condition (obst) is fulfilled
=====================================================

old: t = 2.488; 2.460; 2.712 s
-----------------------------------
do j=1,nj
do i=1,ni
    if (obst(i,j)) then
      f(i,j,1) = fn(i,j,3)
      f(i,j,2) = fn(i,j,4)
      f(i,j,3) = fn(i,j,1)
      f(i,j,4) = fn(i,j,2)
      f(i,j,5) = fn(i,j,7)
      f(i,j,6) = fn(i,j,8)
      f(i,j,7) = fn(i,j,5)
      f(i,j,8) = fn(i,j,6)
      f(i,j,0) = fn(i,j,0)
    endif
enddo
enddo
------------------------------------
new: t = 5.404; 5.628; 5.456 s
------------------------------------
where(obst)
f(:,:,1) = fn(:,:,3)
f(:,:,2) = fn(:,:,4)
f(:,:,3) = fn(:,:,1)
f(:,:,4) = fn(:,:,2)
f(:,:,5) = fn(:,:,7)
f(:,:,6) = fn(:,:,8)
f(:,:,7) = fn(:,:,5)
f(:,:,8) = fn(:,:,6)
f(:,:,0) = fn(:,:,0)
endwhere

=====================
3. Example WHERE,SUM,various:
=====================

old: t = 6.020; 6.056; 6.160 s
-----------------------------------
do j=1,nj
do i=1,ni
    if(.not.obst(i,j))then
      rho(i,j) = fn(i,j,0)+fn(i,j,1)+fn(i,j,2)+fn(i,j,3)+fn(i,j,4)+fn(i,j,5)+fn(i,j,6)+fn(i,j,7)+fn(i,j,8)
      u(i,j) = (fn(i,j,1)+fn(i,j,5)+fn(i,j,8)-fn(i,j,6)-fn(i,j,3)-fn(i,j,7))/rho(i,j)
      v(i,j) = (fn(i,j,5)+fn(i,j,2)+fn(i,j,6)-fn(i,j,7)-fn(i,j,4)-fn(i,j,8))/rho(i,j)
    else
      rho(i,j) = rho_in
      u(i,j) = 0.d0
      v(i,j) = 0.d0
    endif
enddo
enddo
-----------------------------------------
new: t = 12.421; 11.897; 11.645 s
-----------------------------------------
where (.not.obst)
rho(:,:) = sum(fn,3)
u(:,:) = (fn(:,:,1)+fn(:,:,5)+fn(:,:,8)-fn(:,:,6)-fn(:,:,3)-fn(:,:,7))/rho
v(:,:) = (fn(:,:,5)+fn(:,:,2)+fn(:,:,6)-fn(:,:,7)-fn(:,:,4)-fn(:,:,8))/rho
elsewhere
rho = rho_in
u(:,:) = 0.d0
v(:,:) = 0.d0
endwhere
-------------------------------------------

I did not expect a significant performance boost, but also not a drop. Has anybody made similar observations or can explain the results? Any help would be greatly appreciated.

With best regards,
Eric

My setup:

OS: Kubuntu 11.10
Compiler Version: Fortran Intel(R) 64 Compiler XE, Version 12.1.5.339
Compilation flags: none
CPU: Intel(R) Pentium(R) Dual CPU E2180 @ 2.00GHz (cache size: 1024 KB)
Memory: 2 GB
Time measurement: using subroutine CPU_TIME, Code is beeing looped 20000 times (variables are changing every loop)

Used variables:

real*8, dimension(1:100,1:100,0:8) :: f, fn;
logical, dimension(100,100,9) :: obst (10% of the elements are true, arranged as a sphere)
real*8, dimension(100) :: ni, nj, u, v, rho
real*8 :: rho_in

Steven_L_Intel1 · ‎09-04-2012

The semantics of the "old" and "new" code are not the same, and some of your cases, such as the nested cshift, create array temporaries. Using WHERE will create a temporary for the mask. You're right that sometimes the newer usages may be more compact and readable but have hidden performance issues. Generally they work well, but CSHIFT and WHERE are probably not as highly optimized as some other aspects of the language. I would also discourage you from using (:,:) to mean a whole array. Most of the time this is harmless, but it can sometimes cause unwanted side effects.

TimP · ‎09-04-2012

In addition to what Steve said, where..elsewhere often performs relatively poorly, e.g. in comparison to where (.not.obst) ... end where where (obst) ... end where You could check your compiler reports; I would not be surprised to find difficulty in achieving efficient fusion or otherwise optimiziing multiple cshift operations.

Eric3 · ‎09-06-2012

Thank you for your comments. Since i never found any performance gain with the use of where, cshift, sum or vector operations in my code i will stick to the "traditional" programming style using ifs and nested loops, which the compiler seems to auto-vectorize quite well. @ Steve: I just used the (:,:) to indicate an array, but thanks for that hint @ TimP: Even a single cshift operation does not perform faster in my case With best regards, Eric

TimP · ‎09-06-2012

If you consider the use of !dir$ simd firstprivate and the like to be part of traditional programming style, it's true in my experience that performance of cshift and maxval/minval can be improved upon. Unfortunately, those directives have involved changes in usage with each major ifort version, and I haven't found adequate documentation for the just-released 13.0. So it's a stretch to call f77 plus directives "traditional." I don't recall ever seeing where..elsewhere advocated for best performance. A more commonly cited situation from the f95 array syntax add-ons is the performance issues associated with the standard definition of forall. People still in hoping to see fantastic performance with forall in spite of 15 years of discussion and the addition of do concurrent for such situations. Do concurrent also requires directives such as VECTOR ALIGNED for full performance in ifort for cases where they would be needed in f77 code. Upon re-reading your post, I don't know whether you refer to worst-case situations or if you mean to generalize from the specific cases you pose. For example, ifort has had to improve optimization over the years for the most commonly encountered cshift usage (which isn't much like yours), but if you shift to a different brand of compiler, you have a different performance result. Another example is that ifort eoshift remains relatively poor in the 64-bit implementations, not so bad in the 32-bit ones.

holmz · ‎04-28-2014

Steve Lionel (Intel) wrote:

The semantics of the "old" and "new" code are not the same, and some of your cases, such as the nested cshift, create array temporaries. Using WHERE will create a temporary for the mask. You're right that sometimes the newer usages may be more compact and readable but have hidden performance issues. Generally they work well, but CSHIFT and WHERE are probably not as highly optimized as some other aspects of the language.

I would also discourage you from using (:,:) to mean a whole array. Most of the time this is harmless, but it can sometimes cause unwanted side effects.

This is somewhat disappointing, but the WHERE/ENDWHERE and WHERE/ENDWHERE is easy enough compared to WHERE/ELSEWHERE/ENDWHARE.

The (:,:) and (:) <...etc>, I always found to be much easier to have readable code and know that the array was an array and the rank.

Never the less the information is useful.
Thanks, RH

TimP · ‎04-30-2014

holmz wrote:

Quote:

This is somewhat disappointing, but the WHERE/ENDWHERE and WHERE/ENDWHERE is easy enough compared to WHERE/ELSEWHERE/ENDWHARE.

I submitted a problem report on a case where 15.0 beta compiler loses optimization of where(condition)/where(.not. condition). None of the compilers optimize that case with elsewhere. Other such cases continue to perform well. Over the range of my test cases, where varies from slightly better performance than f77 code down to 50% of performance of f77 with vector directives added (possibly due to lack of similar directives for where). I don't entirely understand the effort to obsolete where, but it makes me wonder if there is a consensus against attempting to optimize elsewhere. In comparison with cshift, all my Xeon examples show a way to get at least double the performance, sometimes requiring use of legacy ifort directives. It seems that peeling ought to be able to achieve full performance for cases with a fixed small shift, so it may be a case of not wanting to devote resources to syntax which doesn't have a C counterpart. I've seen references to something similar in cilk, but Intel(r) Cilk(tm) Plus doesn't appear to work with it, even without optimization. Itanium seemed better suited for cshift but compilers didn't begin to optimize it until the platform was already doomed.