Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Slowdown with OpenMP

Matt_S_
Beginner
1,007 Views

I'm getting some pretty unusual results from using OpenMP on a fractional differential equations code written in fortran. No matter where I use OpenMP in the code, whether it be on an intilization loop or on a computational loop, I get a slowdown across the entire code. I can put OpenMP in one loop and it will slow down an unrelated one (timed seperately)! The code is a bit unusual, as it initalizes arrays starting at 0 (and some even negative). For example,

   real*8 :: gx(0:Nx)
   real*8 :: AxLh(1-Nx:Nx-1), AxRh(1-Nx:Nx-1), AxL0(1-Nx:Nx-1), AxR0(1-Nx:Nx-1)

Where Nx is, let's say, 512. Would that possibly have anything to do with the ubiquitous slowdown with OpenMP? Also, any ideas on reducing "pow" overhead in the following snippet would be greatly appreciated

   do k = 1, 5
      hgck = foo_c(k)
      hgpk = foo_p(k)
      do j = 1, 100
         vx = vx + hgck * ux(x, t, foo(j) + hgpk)
      end do 
   end do 

where ux is a function defined by

      function ux(x,t,xi)
      implicit none
   
      real*8 :: x,t,xi, ux
   
      ux = 1.0/pi * exp(cospa*(t+0.5)*xi**alpha)*cos(xi*x)

      return
      end function ux

This is the "fractional" part of the code since 0<alpha<1. pi, cospa, and alpha are parametrized in a module. I'm hoping to get the code optimized and parallelized with OpenMP and then see how it performs on the MIC. I'm currently testing it on NICS's Beacon system.

Thanks in advance!

0 Kudos
11 Replies
TimP
Honored Contributor III
1,007 Views

If you started out by permitting this to vectorize, there may not be sufficient opportunity to gain by parallelism even if you maintain vectorization.  This would apply to Mic only  more so. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,007 Views

Attribute your function UX as a VECTOR function and place in a module such that the compiler can see it when it compiles code from the caller. By doing so, the do j loop might possibly vectorize provided you also use the new simd reduction feature. As Tim stated, if you can vectorize this you might see up to 4x or 8x improvement in serial code... then you can look at parallelization. Vector inner, parallel outer.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,007 Views
Inlining may be as effective and less esoteric as vector syntax.
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,007 Views

Tim,

In your experience, is inlining as effective as explicit vectorization directives. In this situation, the function contains a loop that, IMHO, should have vectorized in place. The suggestion of lifting the problematic code into an outline function is to aid the compiler in inlineing this function as vector IPO will do this.

 I am interested in Matt's response.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
1,007 Views

omp simd reduction could be tested for both auto-inlining or for vector or (Intel) elemental function syntax. 

I haven't found the simd reduction to be of much use outside of rare cases where it might overcome a "seems inefficient" decision by the compiler.  In that case, !dir$ vector directives might also be considered.

The slogan "explicit vectorization" doesn't seem well carried out.  In the XE2013 compiler, it was feasible to use the various simd directives to over-ride compile options such as /fp:source which would prevent optimization of sum reduction.  In spite of the slogan, this was made ineffective in XE2015.  I don't imagine that anyone would carry the slogan to the point of putting NO VECTOR or simd directives on every loop.

I have plenty of cases where usual moderately aggressive compile options produce good vectorized code with ifort and gfortran, but applyi#ig directives makes it worse for one or the other.  As MIC was mentioned in this post, I sometimes need

#if __MIC__

!$omp simd ...

#endif

in order to optimize for both MIC and host.  I'd hesitate to call this explicit MIC vectorization.

omp simd directives ought to offer more portability than the older ones, if Intel would remove the occasional requirement to drop back to the legacy forms of simd directive (even where other compilers optimize with the standard one).  But this is tangential to the present case.

 dot_product looks like a more readable choice than what was presented, and the consequent ruling out of simd directives is no loss.  I'm assuming that vx has local declaration (not shown) so that it doesn't inhibit optimization.  It's important to check the compile diagnostics such as /Qopt-report4. 

As you said, if the situation isn't more complicated than presented, one would hope for easy vectorization.  Poor decisions during openmp parallelization, such as making vx a shared variable, would ruin performance.

0 Kudos
Matt_S_
Beginner
1,007 Views

Tim and Jim, sorry for the delayed response! SC14 is keeping me busy!

Jim, I have u_x defined in a module as follows

module functions

   real*8, parameter :: D = 0.005, alpha = 1.8, pi = 3.14159265358979324, ...
   real*8, parameter :: cospa = -2.0*D*abs(cos(0.5*pi*alpha))
   ...

   contains
   !-----------------------------------------------------------------------------!

   ...
   
   !-----------------------------------------------------------------------------!
      function u_x(x,t,xi)
      implicit none

      real*8 :: x,t,xi, u_x

      u_x = 1.0/pi * exp(cospa*(t+0.5)*xi**alpha)*cos(xi*x)

      return
      end function u_x
   !-----------------------------------------------------------------------------!

   ...
   
end module functions

I am unsure of what you mean by declaring it as a VECTOR function. Is there a new syntax to do this? I have used the SIMD command over the summer on my work at LANL (posted some of it on these forums I believe), so I will test that out and report on how it works. A "do reduction" slowed the code down.

0 Kudos
TimP
Honored Contributor III
1,007 Views

Simd directive on a reduction loop with undeclared reduction may give wrong result.  Parallel loop with undeclared is almost certain to be wrong. Simd reduction inside parallel can be effective.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,007 Views

Tim mentioned elsewhere that it is as effective to use inline functions as it is to use "!DIR$ ATTRIBUTES VECTOR :: YourFunction". When inlining works (vectorizes) that would be the preferred route. Otherwise experiment with !DIR$...

*** As a requirement for vectorization see Tim's comment Simd directive on a reduction loop with undeclared reduction may give wrong result.

Not everything can be vectorized.

**** Unless the newer compiler has changed, declare your constants in the precision of the variable type. Use:

   real*8, parameter :: D = 0.005_8, alpha = 1.8_8, pi = 3.14159265358979324_8, ...
   real*8, parameter :: cospa = -2.0_8*D*abs(cos(0.5_8*pi*alpha))

Jim Dempsey

0 Kudos
Matt_S_
Beginner
1,007 Views

Guys,

Hope you've had a good holiday. I got some good results (3x speedup with no OpenMP yet) on the u_x part of my problem. I used Intel's v?Pow and v?Exp functions to vectorize the whole thing. Here's what it looks like now

subroutine u_x_v(x,t,xi,alphav,m,ux)
...
cospat = cospa*(t+0.5)
call vdpow(m,xi,alphav,xip)
xipm = xip*cospat
call vdexp(m,xipm,xipe)
ux = invpi*xipe*cos(xi*x)

return
end subroutine u_x_v

My issue now is kind of strange. Do any of you know why something as simple as this would not be vectorizing?

program vectest
   implicit none

   integer, parameter :: m = 100
   integer :: i
   real*8, t(0:m), dt

   dt = 1.0/m
   t(0) = 0.0
!DIR$ IVDEP
   do i = 0, m-1
      t(i+1) = t(i) + dt
   end do

   stop
end program vectest

The vector report from -qopt-report still tells me that there is a vector dependence between t and t (the innards of the do loop). Specifically, "vector dependence: assumed FLOW dependence between t line 12 and t line 12". I dealt with this issue over the summer at LANL and the IVDEP directive solved the problem, but it is not cooperating with me now. As a side note, t(0:m) is not my usual allocation style, but that is what is used in the current code I am playing with.

Thanks in advance!

0 Kudos
McCalpinJohn
Honored Contributor III
1,007 Views

That looks like a true vector dependence to me, so the compiler is probably just ignoring your IVDEP pragma.

0 Kudos
Matt_S_
Beginner
1,007 Views

John,

I've come to this conclusion as well. I'm going to do something like t(i) += i*dt to bypass the recursive nature of the initialization. Thanks for the reply!

0 Kudos
Reply