Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Coarray sync problems

Jan_W_2
Beginner
753 Views

Hi,

 

Is there a way to resolv coarray sync issues with vtune? Is there maybe

a tutorial for this?

 

 

Thank you

Jan

 

0 Kudos
7 Replies
Steven_L_Intel1
Employee
684 Views

Please describe the problem in more detail. VTune Amplifier XE won't help with coarray issues as MPI is used. Intel Trace Analyzer and Collector can help with this.

0 Kudos
Jan_W_2
Beginner
684 Views

Dear Steve,

it is a transpose-free QMR solver. I want to use coarrays for parallelize the matrix-vector product.

Unfortunally, if I use more than one image the calculations slow down.

This is the matrix-vector product routine:

subroutine matvec(a,x,y)
      type(coo_matrix), intent(in)       :: a
      complex(dp), dimension(:), intent(in) :: x
      complex(dp), dimension(size(x,1)), intent(out) :: y
      complex(dp), allocatable,dimension(:)  :: tmp[:]
      integer :: i, me, numi

      me   = this_image()
      numi = num_images()
!allocate 
      allocate(tmp(a%n)
  • ) y = cmplx(0.0_dp, 0.0_dp,dp) tmp = cmplx(0.0_dp, 0.0_dp,dp) !sync sync all !Add locally all do i = 1, a%loc_nnz tmp(a%ir(i)) = tmp(a%ir(i)) + (a%val(i) * x(a%jc(i))) end do !sync sync all !Sum all coarrays together y = globalSum_serial(tmp, a%n) !deallocate deallocate(tmp) end subroutine
  • And the sum function is:

    function globalSum_serial(vec,n) result(this)
        complex(dp), dimension(:), intent(inout) :: vec
  • integer, intent(in) :: n complex(dp), dimension(n) :: this integer :: i, me, numi me = this_image() numi = num_images() sync all if(me == 1) then do i = 2, numi vec(:)[1] = vec(:)[1] + vec(:) end do this(:) = vec(:) end if sync all if(me /= 1) this(:) = vec(:)[1] sync all end function
  • I compile this with ifort 15.0.1 using only -coarray.

    When I use a big matrix the more images I use the matvec routine will slow down.

    I did a basic hotspot analysis with vtune and it says that

    ICAF_BARRIER and ICAF_UNLOC are the code segments which need the most time.

    Thank you jan

     

    0 Kudos
    Steven_L_Intel1
    Employee
    684 Views

    What happens if you replace:

    vec(:)[1] = vec(:)[1] + vec(:)

    with:

    vec(:) = vec(:) + vec(:)

    ?

    0 Kudos
    Jan_W_2
    Beginner
    684 Views

    I tried it but it doen't help ... the calculation speed is as before.

    Thank you

    Jan

    0 Kudos
    Steven_L_Intel1
    Employee
    684 Views

    Is VTune telling you where those synchronization calls are coming from in your code? Trace Analyzer and Collector's timeline display can be helpful in understanding what is happening.

    0 Kudos
    Jan_W_2
    Beginner
    684 Views

    Dear Steve,

    I checked the Vtune analysis again, and it is telling me that the barriers are in the globalSum_serial function.

    I also checked the times how long each image needs for summing. They all need more or less the same time.

    Thank you,

    Jan

     

    0 Kudos
    reinhold-bader
    New Contributor II
    684 Views

    What you are doing in your globalsum procedure is a cross-image reduction; while this is formally correct, it is quite inefficient. The statement that is particularly inefficient is the last communication statement

    if(me /= 1) this(:) = vec(:)[1]

    which oversubscribes the network link to image 1. The only "good" solution to this is using a collective call, which presently is not yet defined for coarray Fortran (but hopefully soon will be). For now, I think using MPI_Allreduce in its place should work (some MPI boilerplate may be needed). The alternative would be to implement the reduction manually, using all images (e.g. with a butterfly communication pattern) to reduce the amount of synchronization and avoid oversubscription.

     

    Cheers

    Reinhold

    0 Kudos
    Reply