Community
cancel
Showing results for
Did you mean:
Beginner
98 Views

## Slowdown with OpenMP

I'm getting some pretty unusual results from using OpenMP on a fractional differential equations code written in fortran. No matter where I use OpenMP in the code, whether it be on an intilization loop or on a computational loop, I get a slowdown across the entire code. I can put OpenMP in one loop and it will slow down an unrelated one (timed seperately)! The code is a bit unusual, as it initalizes arrays starting at 0 (and some even negative). For example,

```   real*8 :: gx(0:Nx)
real*8 :: AxLh(1-Nx:Nx-1), AxRh(1-Nx:Nx-1), AxL0(1-Nx:Nx-1), AxR0(1-Nx:Nx-1)
```

Where Nx is, let's say, 512. Would that possibly have anything to do with the ubiquitous slowdown with OpenMP? Also, any ideas on reducing "pow" overhead in the following snippet would be greatly appreciated

```   do k = 1, 5
hgck = foo_c(k)
hgpk = foo_p(k)
do j = 1, 100
vx = vx + hgck * ux(x, t, foo(j) + hgpk)
end do
end do
```

where ux is a function defined by

```      function ux(x,t,xi)
implicit none

real*8 :: x,t,xi, ux

ux = 1.0/pi * exp(cospa*(t+0.5)*xi**alpha)*cos(xi*x)

return
end function ux
```

This is the "fractional" part of the code since 0<alpha<1. pi, cospa, and alpha are parametrized in a module. I'm hoping to get the code optimized and parallelized with OpenMP and then see how it performs on the MIC. I'm currently testing it on NICS's Beacon system.

Tags (1)
11 Replies
Black Belt
98 Views

If you started out by permitting this to vectorize, there may not be sufficient opportunity to gain by parallelism even if you maintain vectorization.  This would apply to Mic only  more so.

Black Belt
98 Views

Attribute your function UX as a VECTOR function and place in a module such that the compiler can see it when it compiles code from the caller. By doing so, the do j loop might possibly vectorize provided you also use the new simd reduction feature. As Tim stated, if you can vectorize this you might see up to 4x or 8x improvement in serial code... then you can look at parallelization. Vector inner, parallel outer.

Jim Dempsey

Black Belt
98 Views
Inlining may be as effective and less esoteric as vector syntax.
Black Belt
98 Views

Tim,

In your experience, is inlining as effective as explicit vectorization directives. In this situation, the function contains a loop that, IMHO, should have vectorized in place. The suggestion of lifting the problematic code into an outline function is to aid the compiler in inlineing this function as vector IPO will do this.

I am interested in Matt's response.

Jim Dempsey

Black Belt
98 Views

omp simd reduction could be tested for both auto-inlining or for vector or (Intel) elemental function syntax.

I haven't found the simd reduction to be of much use outside of rare cases where it might overcome a "seems inefficient" decision by the compiler.  In that case, !dir\$ vector directives might also be considered.

The slogan "explicit vectorization" doesn't seem well carried out.  In the XE2013 compiler, it was feasible to use the various simd directives to over-ride compile options such as /fp:source which would prevent optimization of sum reduction.  In spite of the slogan, this was made ineffective in XE2015.  I don't imagine that anyone would carry the slogan to the point of putting NO VECTOR or simd directives on every loop.

I have plenty of cases where usual moderately aggressive compile options produce good vectorized code with ifort and gfortran, but applyi#ig directives makes it worse for one or the other.  As MIC was mentioned in this post, I sometimes need

#if __MIC__

!\$omp simd ...

#endif

in order to optimize for both MIC and host.  I'd hesitate to call this explicit MIC vectorization.

omp simd directives ought to offer more portability than the older ones, if Intel would remove the occasional requirement to drop back to the legacy forms of simd directive (even where other compilers optimize with the standard one).  But this is tangential to the present case.

dot_product looks like a more readable choice than what was presented, and the consequent ruling out of simd directives is no loss.  I'm assuming that vx has local declaration (not shown) so that it doesn't inhibit optimization.  It's important to check the compile diagnostics such as /Qopt-report4.

As you said, if the situation isn't more complicated than presented, one would hope for easy vectorization.  Poor decisions during openmp parallelization, such as making vx a shared variable, would ruin performance.

Beginner
98 Views

Tim and Jim, sorry for the delayed response! SC14 is keeping me busy!

Jim, I have u_x defined in a module as follows

```module functions

real*8, parameter :: D = 0.005, alpha = 1.8, pi = 3.14159265358979324, ...
real*8, parameter :: cospa = -2.0*D*abs(cos(0.5*pi*alpha))
...

contains
!-----------------------------------------------------------------------------!

...

!-----------------------------------------------------------------------------!
function u_x(x,t,xi)
implicit none

real*8 :: x,t,xi, u_x

u_x = 1.0/pi * exp(cospa*(t+0.5)*xi**alpha)*cos(xi*x)

return
end function u_x
!-----------------------------------------------------------------------------!

...

end module functions
```

I am unsure of what you mean by declaring it as a VECTOR function. Is there a new syntax to do this? I have used the SIMD command over the summer on my work at LANL (posted some of it on these forums I believe), so I will test that out and report on how it works. A "do reduction" slowed the code down.

Black Belt
98 Views

Simd directive on a reduction loop with undeclared reduction may give wrong result.  Parallel loop with undeclared is almost certain to be wrong. Simd reduction inside parallel can be effective.

Black Belt
98 Views

Tim mentioned elsewhere that it is as effective to use inline functions as it is to use "!DIR\$ ATTRIBUTES VECTOR :: YourFunction". When inlining works (vectorizes) that would be the preferred route. Otherwise experiment with !DIR\$...

*** As a requirement for vectorization see Tim's comment Simd directive on a reduction loop with undeclared reduction may give wrong result.

Not everything can be vectorized.

**** Unless the newer compiler has changed, declare your constants in the precision of the variable type. Use:

```   real*8, parameter :: D = 0.005_8, alpha = 1.8_8, pi = 3.14159265358979324_8, ...
real*8, parameter :: cospa = -2.0_8*D*abs(cos(0.5_8*pi*alpha))
```

Jim Dempsey

Beginner
98 Views

Guys,

Hope you've had a good holiday. I got some good results (3x speedup with no OpenMP yet) on the u_x part of my problem. I used Intel's v?Pow and v?Exp functions to vectorize the whole thing. Here's what it looks like now

```subroutine u_x_v(x,t,xi,alphav,m,ux)
...
cospat = cospa*(t+0.5)
call vdpow(m,xi,alphav,xip)
xipm = xip*cospat
call vdexp(m,xipm,xipe)
ux = invpi*xipe*cos(xi*x)

return
end subroutine u_x_v```

My issue now is kind of strange. Do any of you know why something as simple as this would not be vectorizing?

```program vectest
implicit none

integer, parameter :: m = 100
integer :: i
real*8, t(0:m), dt

dt = 1.0/m
t(0) = 0.0
!DIR\$ IVDEP
do i = 0, m-1
t(i+1) = t(i) + dt
end do

stop
end program vectest```

The vector report from -qopt-report still tells me that there is a vector dependence between t and t (the innards of the do loop). Specifically, "vector dependence: assumed FLOW dependence between t line 12 and t line 12". I dealt with this issue over the summer at LANL and the IVDEP directive solved the problem, but it is not cooperating with me now. As a side note, t(0:m) is not my usual allocation style, but that is what is used in the current code I am playing with.