I'm getting some pretty unusual results from using OpenMP on a fractional differential equations code written in fortran. No matter where I use OpenMP in the code, whether it be on an intilization loop or on a computational loop, I get a slowdown across the entire code. I can put OpenMP in one loop and it will slow down an unrelated one (timed seperately)! The code is a bit unusual, as it initalizes arrays starting at 0 (and some even negative). For example,
real*8 :: gx(0:Nx) real*8 :: AxLh(1-Nx:Nx-1), AxRh(1-Nx:Nx-1), AxL0(1-Nx:Nx-1), AxR0(1-Nx:Nx-1)
Where Nx is, let's say, 512. Would that possibly have anything to do with the ubiquitous slowdown with OpenMP? Also, any ideas on reducing "pow" overhead in the following snippet would be greatly appreciated
do k = 1, 5 hgck = foo_c(k) hgpk = foo_p(k) do j = 1, 100 vx = vx + hgck * ux(x, t, foo(j) + hgpk) end do end do
where ux is a function defined by
function ux(x,t,xi) implicit none real*8 :: x,t,xi, ux ux = 1.0/pi * exp(cospa*(t+0.5)*xi**alpha)*cos(xi*x) return end function ux
This is the "fractional" part of the code since 0<alpha<1. pi, cospa, and alpha are parametrized in a module. I'm hoping to get the code optimized and parallelized with OpenMP and then see how it performs on the MIC. I'm currently testing it on NICS's Beacon system.
Thanks in advance!
If you started out by permitting this to vectorize, there may not be sufficient opportunity to gain by parallelism even if you maintain vectorization. This would apply to Mic only more so.
Attribute your function UX as a VECTOR function and place in a module such that the compiler can see it when it compiles code from the caller. By doing so, the do j loop might possibly vectorize provided you also use the new simd reduction feature. As Tim stated, if you can vectorize this you might see up to 4x or 8x improvement in serial code... then you can look at parallelization. Vector inner, parallel outer.
In your experience, is inlining as effective as explicit vectorization directives. In this situation, the function contains a loop that, IMHO, should have vectorized in place. The suggestion of lifting the problematic code into an outline function is to aid the compiler in inlineing this function as vector IPO will do this.
I am interested in Matt's response.
omp simd reduction could be tested for both auto-inlining or for vector or (Intel) elemental function syntax.
I haven't found the simd reduction to be of much use outside of rare cases where it might overcome a "seems inefficient" decision by the compiler. In that case, !dir$ vector directives might also be considered.
The slogan "explicit vectorization" doesn't seem well carried out. In the XE2013 compiler, it was feasible to use the various simd directives to over-ride compile options such as /fp:source which would prevent optimization of sum reduction. In spite of the slogan, this was made ineffective in XE2015. I don't imagine that anyone would carry the slogan to the point of putting NO VECTOR or simd directives on every loop.
I have plenty of cases where usual moderately aggressive compile options produce good vectorized code with ifort and gfortran, but applyi#ig directives makes it worse for one or the other. As MIC was mentioned in this post, I sometimes need
!$omp simd ...
in order to optimize for both MIC and host. I'd hesitate to call this explicit MIC vectorization.
omp simd directives ought to offer more portability than the older ones, if Intel would remove the occasional requirement to drop back to the legacy forms of simd directive (even where other compilers optimize with the standard one). But this is tangential to the present case.
dot_product looks like a more readable choice than what was presented, and the consequent ruling out of simd directives is no loss. I'm assuming that vx has local declaration (not shown) so that it doesn't inhibit optimization. It's important to check the compile diagnostics such as /Qopt-report4.
As you said, if the situation isn't more complicated than presented, one would hope for easy vectorization. Poor decisions during openmp parallelization, such as making vx a shared variable, would ruin performance.
Tim and Jim, sorry for the delayed response! SC14 is keeping me busy!
Jim, I have u_x defined in a module as follows
module functions real*8, parameter :: D = 0.005, alpha = 1.8, pi = 3.14159265358979324, ... real*8, parameter :: cospa = -2.0*D*abs(cos(0.5*pi*alpha)) ... contains !-----------------------------------------------------------------------------! ... !-----------------------------------------------------------------------------! function u_x(x,t,xi) implicit none real*8 :: x,t,xi, u_x u_x = 1.0/pi * exp(cospa*(t+0.5)*xi**alpha)*cos(xi*x) return end function u_x !-----------------------------------------------------------------------------! ... end module functions
I am unsure of what you mean by declaring it as a VECTOR function. Is there a new syntax to do this? I have used the SIMD command over the summer on my work at LANL (posted some of it on these forums I believe), so I will test that out and report on how it works. A "do reduction" slowed the code down.
Simd directive on a reduction loop with undeclared reduction may give wrong result. Parallel loop with undeclared is almost certain to be wrong. Simd reduction inside parallel can be effective.
Tim mentioned elsewhere that it is as effective to use inline functions as it is to use "!DIR$ ATTRIBUTES VECTOR :: YourFunction". When inlining works (vectorizes) that would be the preferred route. Otherwise experiment with !DIR$...
*** As a requirement for vectorization see Tim's comment Simd directive on a reduction loop with undeclared reduction may give wrong result.
Not everything can be vectorized.
**** Unless the newer compiler has changed, declare your constants in the precision of the variable type. Use:
real*8, parameter :: D = 0.005_8, alpha = 1.8_8, pi = 3.14159265358979324_8, ... real*8, parameter :: cospa = -2.0_8*D*abs(cos(0.5_8*pi*alpha))
Hope you've had a good holiday. I got some good results (3x speedup with no OpenMP yet) on the u_x part of my problem. I used Intel's v?Pow and v?Exp functions to vectorize the whole thing. Here's what it looks like now
subroutine u_x_v(x,t,xi,alphav,m,ux) ... cospat = cospa*(t+0.5) call vdpow(m,xi,alphav,xip) xipm = xip*cospat call vdexp(m,xipm,xipe) ux = invpi*xipe*cos(xi*x) return end subroutine u_x_v
My issue now is kind of strange. Do any of you know why something as simple as this would not be vectorizing?
program vectest implicit none integer, parameter :: m = 100 integer :: i real*8, t(0:m), dt dt = 1.0/m t(0) = 0.0 !DIR$ IVDEP do i = 0, m-1 t(i+1) = t(i) + dt end do stop end program vectest
The vector report from -qopt-report still tells me that there is a vector dependence between t and t (the innards of the do loop). Specifically, "vector dependence: assumed FLOW dependence between t line 12 and t line 12". I dealt with this issue over the summer at LANL and the IVDEP directive solved the problem, but it is not cooperating with me now. As a side note, t(0:m) is not my usual allocation style, but that is what is used in the current code I am playing with.
Thanks in advance!