Parallelising this algorithm for MIC: Stuck!

james_B_8 · ‎08-20-2013

Hello.

This is also and OpenMP question.

I'm trying to parallelise this algorithm for the MIC. It does cubic spline interpolation of SourceNum*Ni number of vectors of length Nj. So the interpolation is done along the j direction. There are Ni of these vectors - all independent. Due to a compromise of data layout elsewhere in the program I have to arrange it so that i is the least significant index, not j. This makes the algorithm a bit more complicated.

Basically the k loop and the i loop are both embarrassingly parallel. The middle j loop has a flow dependence and hence is serial. The k loop has a trip count of 3, j - ~180, and i - ~500. What I want to do is have nested parallelism. There's enough threads on Phi that I reckon it'll be worth it. I'm reluctant to implement it as nested parallel do's just because there's the overhead of starting those inner parallel loops.

What I tried already was to make the k loop a parallel do, and to have the j loop spawn tasks of chunks of the i loop. This resulted in horrible bugs and race conditions, however, that I could not figure out. Likely due to my ignorance of how tasks work. Could someone take a fresh look at it and give me some suggestions of how to parallelise it with OMP, please?

NB Evolve_q, TimeSteps, Src, and ddSrc are static global vars inside the containing module.

Cheers,
James

[fortran]

subroutine InitSourceInterpolation_MIC
    use omp_lib, only: omp_get_wtime
    integer i,j,k,Ni,Nj
    real(dl) :: t0
    real(dl), allocatable :: u(:,:)

    Ni = TimeSteps%npoints
    Nj = Evolve_q%npoints

    allocate(u(Ni,Nj))

    t0 = omp_get_wtime()

    do k=1,SourceNum

        ! Set the upper and lower boundary condtions to be natural (0)
        do i=1,Ni
            ddSrc(i,1,k)    = 0._dl
            ddSrc(i,Nj,k)   = 0._dl
            u(i,1)        = 0._dl
            u(i,Nj)       = 0._dl
        end do

        ! do the decomposition loop of the tridiagonal algorithm for all the systems
        ! of equations
        do j=2,Nj-1
            do i=1,Ni
                call spline_j(Evolve_q%points(j-1), Evolve_q%points(j), &
                                Evolve_q%points(j+1),                    &
                                Src(i,j-1,k), Src(i,j,k), Src(i,j+1,k), &
                                u(i,j-1), ddSrc(i,j-1,k),               &
                                u(i,j), ddSrc(i,j,k) )
            end do
        end do

        ! do the backsubstitution loop of the tridiagonal algorithm
        do j=Nj-1, 1, -1
            do i=1,Ni
                ddSrc(i,j,k) = ddSrc(i,j,k)*ddSrc(i,j+1,k) + u(i,j)
            end do
        end do

    end do
    ! clean up
    deallocate(u)

contains

   pure subroutine spline_j( xMin1, x, xAdd1, &
                              yMin1, y, yAdd1, &
                              uMin1, y2Min1,   &
                              u, y2 )

        real(dl), intent(in) :: xMin1, x, xAdd1
        real(dl), intent(in) :: yMin1, y, yAdd1
        real(dl), intent(in) :: uMin1, y2Min1
        real(dl), intent(out) :: u, y2

        real(dl) :: sig, p, pp

        sig = (x-xMin1) / (xAdd1-xMin1)
        p   = sig * y2Min1 + 2
        pp = (yAdd1-y)/(xAdd1-x) - (y-yMin1)/(x-xMin1)

        y2 = (sig-1._dl)/p
        u = ( 6._dl * pp / (xAdd1-xMin1) - sig*uMin1)/p
    end subroutine spline_j

end subroutine InitSourceInterpolation_MIC

[/fortran]

James_C_Intel2 · ‎08-21-2013

Before getting too subtle, have you tried doing something simple and measuring it?

I'd try

omp parallel do above line 12 (the zero-ing). This is just to ensure that the intiializations are done in the threads that then read the same data, to try to avoid cache-misses. It may not actually gain anything. (So try with and without).
omp parallel above line 20 with the end parallel above line 35, and add private(j)
omp do on the two included i loops (at lines 21 and 31)

I think that should be correct (the barriers at the end of the work-sharing loops prevent the outer loops from getting out of step in the different threads), and there's only one fork/join for each of the outermost loops, so the overhead shouldn't be too bad.

jimdempseyatthecove · ‎08-21-2013

The following, untested, may be a good starting point:

[fortran]
subroutine InitSourceInterpolation_MIC
    use omp_lib, only: omp_get_wtime, omp_get_max_threads, omp_set_num_threads
    integer i,j,k,Ni,Nj
    integer :: nThreads
    real(dl) :: t0
    real(dl), allocatable :: u(:,:)
    Ni = TimeSteps%npoints
    Nj = Evolve_q%npoints
    t0 = omp_get_wtime()
    nThreads = omp_get_max_threads()
    call omp_set_num_threads(SourceNum) ! 3
!$OMP PARALLEL PRIVATE(u,k,i,j)
    allocate(u(Ni,Nj))
!$OMP DO SCHEDULE(STATIC,1)
    do k=1,SourceNum
        ! Set the upper and lower boundary condtions to be natural (0)
        call omp_set_num_threads(Ni/SourceNum) ! ~500/3
!$OMP PARALLEL DO
        do i=1,Ni
            ddSrc(i,1,k)    = 0._dl
            ddSrc(i,Nj,k)   = 0._dl
            u(i,1)        = 0._dl
            u(i,Nj)       = 0._dl
        end do
!$OMP END PARALLEL DO
        ! do the decomposition loop of the tridiagonal algorithm for all the systems
        ! of equations
        ! N.B. Due to spline_J being called with j-1 and j+1 we cannot parallelize j
        ! except in differing k sections (see above)
        do j=2,Nj-1
            call omp_set_num_threads(Ni/SourceNum) ! ~500/3
!$OMP PARALLEL DO
            do i=1,Ni
                call spline_j(Evolve_q%points(j-1), Evolve_q%points(j), &
                                Evolve_q%points(j+1),                    &
                                Src(i,j-1,k), Src(i,j,k), Src(i,j+1,k), &
                                u(i,j-1), ddSrc(i,j-1,k),               &
                                u(i,j), ddSrc(i,j,k) )
            end do
!$OMP END PARALLEL DO
        end do
        ! do the backsubstitution loop of the tridiagonal algorithm
        do j=Nj-1, 1, -1
            do i=1,Ni
                ddSrc(i,j,k) = ddSrc(i,j,k)*ddSrc(i,j+1,k) + u(i,j)
            end do
        end do
    end do
!$OMP END DO
    ! clean up
    deallocate(u)
!$OMP END PARALLEL
    t0 = omp_get_wtime() - t0 ! elapse time
contains
   pure subroutine spline_j( xMin1, x, xAdd1, &
                              yMin1, y, yAdd1, &
                              uMin1, y2Min1,   &
                              u, y2 )
        real(dl), intent(in) :: xMin1, x, xAdd1
        real(dl), intent(in) :: yMin1, y, yAdd1
        real(dl), intent(in) :: uMin1, y2Min1
        real(dl), intent(out) :: u, y2
        real(dl) :: sig, p, pp
        sig = (x-xMin1) / (xAdd1-xMin1)
        p   = sig * y2Min1 + 2
        pp = (yAdd1-y)/(xAdd1-x) - (y-yMin1)/(x-xMin1)
        y2 = (sig-1._dl)/p
        u = ( 6._dl * pp / (xAdd1-xMin1) - sig*uMin1)/p
    end subroutine spline_j
end subroutine InitSourceInterpolation_MIC
[/fortran]

This will require:

a) nested parallelization enabled
b) InitSourceInterpolation_MIC being called while NOT in parallel region

Jim Dempsey

jimdempseyatthecove · ‎08-21-2013

Note, the second i loop may not benefit from parallelization (or too much parallelization) due to lack of degree of computation. You could experiment with varying the number of threads (no parallel do, 2-thread parallel do, 3-thread, ...).

Jim Dempsey

james_B_8 · ‎08-21-2013

Thanks Jim. I'll give it a try. I managed to get my tasking version (where tasks did 100 sized chunks of the i loop) working, but it turned out to be slightly slower when ran on 120 Xeon Phi threads than when ran serially on 1 Xeon core. So even with the additional data transfers it's still cheaper to leave this computation on the host side...
Do tasks have a higher overhead than nested parallel do's here you think?

NB There is a race condition present in the original code I wrote. The temporary array 'u' is shared by all threads - this is broken. u needs to be changed to 3d u(:,:,:) with dims (Ni,Nj,SourceNum)

Cheers,
James

jimdempseyatthecove · ‎08-21-2013

RE: u

In the code suggestion I offered, u is private (and allocated) for each thread running the k loop, and shared respectively for each team within the k loop.

Jim Dempsey

james_B_8 · ‎08-26-2013

Hi Jim

I tried your suggestion and it did improve things. Fairly nice 90x scaling with 116 threads on MIC. However the first loop has some race condition somewhere and the answer is coming out slightly wrong. Taking the nested parallelism off of that loop fixes it, but kills performance for large Ni. I cannot find what I'm doing wrong though. Can you see anything?

[fortran]

subroutine InitSourceInterpolation_new
    use omp_lib, only: omp_get_wtime, omp_get_max_threads, omp_set_num_threads
    integer i,j,k,Ni,Nj,ii
    real(dl) :: t0
    real(dl), allocatable :: u(:,:)
    integer :: nThreads

    Ni = TimeSteps_npoints
    Nj = Evolve_q_npoints
    t0 = omp_get_wtime()
    nThreads = omp_get_max_threads()

    call omp_set_num_threads(SourceNum)

    !$OMP PARALLEL DEFAULT(none) PRIVATE(u,k,i,j) &
    !$OMP & SHARED(nThreads,SourceNum,Ni,Nj,pEvolve_q_points,TimeSteps_npoints,Evolve_q_npoints,Src,ddSrc)

    allocate(u(Ni,Nj))
    !DIR$ ATTRIBUTES ALIGN : 64 :: u

    !$OMP DO SCHEDULE(STATIC,1)
    do k=1,SourceNum

        ! Set the upper and lower boundary condtions to be natural (0)
        call omp_set_num_threads(nThreads/SourceNum)
        !$OMP PARALLEL DO DEFAULT(shared) PRIVATE(i) SHARED(ddSrc,u,Ni,Nj,TimeSteps_npoints,Evolve_q_npoints,k)
        do i=1,Ni
            ddSrc(i,1,k)    = 0._dl
            ddSrc(i,Nj,k)   = 0._dl
            u(i,1)        = 0._dl
            u(i,Nj)       = 0._dl
        end do
        !$OMP END PARALLEL DO

      ! do the decomposition loop of the tridiagonal algorithm for all the systems
        ! of equations
        do j=2,Nj-1
            call omp_set_num_threads(nThreads/SourceNum)
            !$OMP PARALLEL DO DEFAULT(none) PRIVATE(i) &
            !$OMP & SHARED(pEvolve_q_points,Src,ddSrc,u,Ni,TimeSteps_npoints,j,k)
            do i=1,Ni
                call spline_j(pEvolve_q_points(j-1), pEvolve_q_points(j), &
                    pEvolve_q_points(j+1),                    &
                    Src(i,j-1,k), Src(i,j,k), Src(i,j+1,k), &
                    u(i,j-1), ddSrc(i,j-1,k),               &
                    u(i,j), ddSrc(i,j,k) )
            end do
            !$OMP END PARALLEL DO
        end do

        ! do the backsubstitution loop of the tridiagonal algorithm
        do j=Nj-1, 1, -1
            do i=1,Ni
                ddSrc(i,j,k) = ddSrc(i,j,k)*ddSrc(i,j+1,k) + u(i,j)
            end do
        end do

    end do
    !$OMP END DO
   ! clean up
    deallocate(u)

    !$OMP END PARALLEL

    ! set num threads back to max
    call omp_set_num_threads(nThreads)

    print *, "New InitSourceInterpolation time = ", omp_get_wtime() - t0

    contains

    !DIR$ ATTRIBUTES OFFLOAD : mic :: spline_j
    pure subroutine spline_j( xMin1, x, xAdd1, &
                              yMin1, y, yAdd1, &
                              uMin1, y2Min1,   &
                              u, y2 )

        real(dl), intent(in) :: xMin1, x, xAdd1
        real(dl), intent(in) :: yMin1, y, yAdd1
        real(dl), intent(in) :: uMin1, y2Min1
        real(dl), intent(out) :: u, y2

        real(dl) :: sig, p, pp

        sig = (x-xMin1) / (xAdd1-xMin1)
        p   = sig * y2Min1 + 2
        pp = (yAdd1-y)/(xAdd1-x) - (y-yMin1)/(x-xMin1)

        y2 = (sig-1._dl)/p
        u = ( 6._dl * pp / (xAdd1-xMin1) - sig*uMin1)/p
    end subroutine spline_j

    end subroutine InitSourceInterpolation_new

[/fortran]

Thanks,

James

jimdempseyatthecove · ‎08-26-2013

Could you show the entire snip of the code as you did in your first post? This will eliminate any difference in assumptions we may have.

Until I see the code, the only thing I can think of is if the compiler is confused as to if the SHARED(u) of the nested region is referencing a) the PRIVATE(u) of the enclosing region, or b) the outer (non-region) u. Scoping-wise, it should reference the PRIVATE(u) of the enclosing region. However, this would be a good candidate for a bug. You could test this, and construct a work-around, by extracting the body of the do k loop into a subroutine, thus causing the compiler to see only the u of the DUMMY argument passed via the call.

90x scaling is quite remarkable considering that you have ?60? cores

Jim Dempsey

james_B_8 · ‎08-26-2013

Yeah actually upon reevalulation I botched the timing! :PP The speed up is more like 55x lol!

james_B_8 · ‎08-26-2013

Actually another bizarre thing is if I set 'omp_set_nested(.TRUE.)' then all the performance goes away... Why is it fast when nested parallelism is apparently disabled (by default?)?

jimdempseyatthecove · ‎08-26-2013

Is your application run as a Native application in the MIC or Offloaded subroutine (procedure)?

For Offload, the example on page 45 of "heterogeneous-programming-model.pdf", shows "call omp_set_nested(.true.)" being issued on the host prior to sections. And an offloaded parallel do being issued in a section. It is unclear as to if the "nested" property migrates into the MIC.

If you Offload the entire subroutine, then the "call omp_set_nested(.true.)", if issued within the subroutine, should apply to the process in the MIC running the offloaded subroutine (assuming MIC environment variable does not override "call omp_set_nested(.true.)").

Jim Dempsey

james_B_8 · ‎08-27-2013

Ok. I stuck the body of the k loop in a subroutine and the race condition went away and took all the performance with it. It now runs at pretty much the same time as it does serial on 1 Xeon core on the host. It must have been some weird bug that made it skip a load of work and give a fast time. I might just leave it on the host and optimise it there using cache blocking, for example.

Thanks for your help.

James

James_C_Intel2 · ‎08-27-2013

James B. wrote:

Actually another bizarre thing is if I set 'omp_set_nested(.TRUE.)' then all the performance goes away... Why is it fast when nested parallelism is apparently disabled (by default?)?

Nesting in OpenMP is dangerous and potentially expensive. Unless you are very careful you will have exponential over-subscription. (In your case with two levels n**2 over-subscription). Asking the kernel to multiplex ~240 threads onto a single hardware thread can be very bad indeed for performance.

This is whay my suggested parallelization did not use any nested parallelism...

TimP · ‎08-27-2013

If you can keep a sufficient number of threads busy at the outer level, that would be more efficient than nested parallelism. Nested parallelism would be considered in the case where insufficient parallelism is available at the outer level, and may need facilities for controlling locality which haven't existed in any high level fashion.

OpenMP 4.0 purports to offer facilities to support data locality under nested parallelism, but we haven't been offered any working examples.

james_B_8 · ‎08-27-2013

TimP the problem here is that the outer loop can only be 2 or 3 that's why I wanted nested parallelism. I got it to work with about a 5-10x strong scaling on MIC. But it was still slower than just doing it serially on the Host side and transfering the data back so I've just left it.

Yes, I tried to use OMP4's thread teams here with version 14 of ifort, but you can only use it immediately following the OMP4 equivalent of an offload region apparently. Still waiting on those examples to come out... :/

James_C_Intel2 · ‎08-27-2013

> TimP the problem here is that the outer loop canonly be 2 or 3 that's why I wanted nested parallelism.

If you look at my suggestion above, I was suggesting distributing the "i" loops, which (you said) is where the parallelism is, but only using one level of parallelism.

james_B_8 · ‎08-27-2013

Hi James,

Yeah I tried that too. It made it faster, but it's still faster on the host side, serial.

James

Ravi_V_Intel · ‎09-10-2013

Have you considered using Intel(R) Advisor XE 2013 to model parallelism in your code above? With this tool, you can not only figure out the scalability of the piece of code above without even parallelising it (but you'd model it using Advisor XE annotations to simulate as if it were parallelized), but you can even run correctness analysis to discover any potential race conditions. You can obtain this information without having to actually parallelize the code (instead, you use the Intel(R) Advisor XE annotations to mark the code you wish to parallelize). Please check out the tool to see if it helps. If you need any help, please let us know. Intel(R) Advisor XE 2013 supports C, C++ and Fortran code on both Micorosft Windows* and Linux*, and C# on Microsoft Windows*. Here is how a typical scalability graph looks in Intel(R) Advisor XE 2013 :

Ravi_V_Intel · ‎09-10-2013

I also meant to mention that the latest version of Intel(R) Advisor XE 2013 Update 4 supports scalability numbers for upto 8192 CPUs. This can help you guage whether it is worth your while to parallelize a piece of code for Intel(R) Xeon Phi(TM) processors or other many-core processors Please check out this article that explains this in more detail:

Discover scalability of your software for upto 8192 CPUs using Update 4 version of Intel(R) Advisor XE 2013