Coarray sync problems

Jan_W_2 · ‎05-13-2015

Hi,

Is there a way to resolv coarray sync issues with vtune? Is there maybe

a tutorial for this?

Thank you

Jan

Steven_L_Intel1 · ‎05-13-2015

Please describe the problem in more detail. VTune Amplifier XE won't help with coarray issues as MPI is used. Intel Trace Analyzer and Collector can help with this.

Jan_W_2 · ‎05-13-2015

Dear Steve,

it is a transpose-free QMR solver. I want to use coarrays for parallelize the matrix-vector product.

Unfortunally, if I use more than one image the calculations slow down.

This is the matrix-vector product routine:

subroutine matvec(a,x,y)
      type(coo_matrix), intent(in)       :: a
      complex(dp), dimension(:), intent(in) :: x
      complex(dp), dimension(size(x,1)), intent(out) :: y
      complex(dp), allocatable,dimension(:)  :: tmp[:]
      integer :: i, me, numi

      me   = this_image()
      numi = num_images()
!allocate 
      allocate(tmp(a%n))
      y   = cmplx(0.0_dp, 0.0_dp,dp)
      tmp = cmplx(0.0_dp, 0.0_dp,dp)
!sync
      sync all
!Add locally all
      do i = 1, a%loc_nnz
        tmp(a%ir(i)) = tmp(a%ir(i)) + (a%val(i) * x(a%jc(i)))
      end do
!sync
      sync all
!Sum all coarrays together
      y = globalSum_serial(tmp, a%n)
!deallocate
      deallocate(tmp)
end subroutine

And the sum function is:

function globalSum_serial(vec,n) result(this)
    complex(dp), dimension(:), intent(inout) :: vec
    integer, intent(in) :: n
    complex(dp), dimension(n) :: this
    integer :: i, me, numi

    me   = this_image()
    numi = num_images()

    sync all

    if(me ==  1) then
      do i = 2, numi
        vec(:)[1] = vec(:)[1] + vec(:)
      end do
        this(:) = vec(:)
    end if

    sync all
    if(me /= 1) this(:) = vec(:)[1]
    sync all
end function

I compile this with ifort 15.0.1 using only -coarray.

When I use a big matrix the more images I use the matvec routine will slow down.

I did a basic hotspot analysis with vtune and it says that

ICAF_BARRIER and ICAF_UNLOC are the code segments which need the most time.

Thank you jan

Steven_L_Intel1 · ‎05-13-2015

What happens if you replace:

vec(:)[1] = vec(:)[1] + vec(:)

with:

vec(:) = vec(:) + vec(:)

?

Jan_W_2 · ‎05-13-2015

I tried it but it doen't help ... the calculation speed is as before.

Thank you

Jan

Steven_L_Intel1 · ‎05-14-2015

Is VTune telling you where those synchronization calls are coming from in your code? Trace Analyzer and Collector's timeline display can be helpful in understanding what is happening.

Jan_W_2 · ‎05-25-2015

Dear Steve,

I checked the Vtune analysis again, and it is telling me that the barriers are in the globalSum_serial function.

I also checked the times how long each image needs for summing. They all need more or less the same time.

Thank you,

Jan

reinhold-bader · ‎05-27-2015

What you are doing in your globalsum procedure is a cross-image reduction; while this is formally correct, it is quite inefficient. The statement that is particularly inefficient is the last communication statement

if(me /= 1) this(:) = vec(:)[1]

which oversubscribes the network link to image 1. The only "good" solution to this is using a collective call, which presently is not yet defined for coarray Fortran (but hopefully soon will be). For now, I think using MPI_Allreduce in its place should work (some MPI boilerplate may be needed). The alternative would be to implement the reduction manually, using all images (e.g. with a butterfly communication pattern) to reduce the amount of synchronization and avoid oversubscription.

Cheers

Reinhold