topic Is VTune telling you where in Intel® Fortran Compiler

Coarray sync problems

Jan_W_2 — Wed, 13 May 2015 08:35:17 GMT

Hi,

Is there a way to resolv coarray sync issues with vtune? Is there maybe

a tutorial for this?

Thank you

Jan

Please describe the problem

Steven_L_Intel1 — Wed, 13 May 2015 13:37:47 GMT

Please describe the problem in more detail. VTune Amplifier XE won't help with coarray issues as MPI is used. Intel Trace Analyzer and Collector can help with this.

Dear Steve,

Jan_W_2 — Wed, 13 May 2015 15:36:22 GMT

Dear Steve,

it is a transpose-free QMR solver. I want to use coarrays for parallelize the matrix-vector product.

Unfortunally, if I use more than one image the calculations slow down.

This is the matrix-vector product routine:

subroutine matvec(a,x,y)
      type(coo_matrix), intent(in)       :: a
      complex(dp), dimension(:), intent(in) :: x
      complex(dp), dimension(size(x,1)), intent(out) :: y
      complex(dp), allocatable,dimension(:)  :: tmp[:]
      integer :: i, me, numi

      me   = this_image()
      numi = num_images()
!allocate 
      allocate(tmp(a%n))
      y   = cmplx(0.0_dp, 0.0_dp,dp)
      tmp = cmplx(0.0_dp, 0.0_dp,dp)
!sync
      sync all
!Add locally all
      do i = 1, a%loc_nnz
        tmp(a%ir(i)) = tmp(a%ir(i)) + (a%val(i) * x(a%jc(i)))
      end do
!sync
      sync all
!Sum all coarrays together
      y = globalSum_serial(tmp, a%n)
!deallocate
      deallocate(tmp)
end subroutine

And the sum function is:

function globalSum_serial(vec,n) result(this)
    complex(dp), dimension(:), intent(inout) :: vec
    integer, intent(in) :: n
    complex(dp), dimension(n) :: this
    integer :: i, me, numi

    me   = this_image()
    numi = num_images()

    sync all

    if(me ==  1) then
      do i = 2, numi
        vec(:)[1] = vec(:)[1] + vec(:)
      end do
        this(:) = vec(:)
    end if

    sync all
    if(me /= 1) this(:) = vec(:)[1]
    sync all
end function

I compile this with ifort 15.0.1 using only -coarray.

When I use a big matrix the more images I use the matvec routine will slow down.

I did a basic hotspot analysis with vtune and it says that

ICAF_BARRIER and ICAF_UNLOC are the code segments which need the most time.

Thank you jan

What happens if you replace:

Steven_L_Intel1 — Wed, 13 May 2015 17:47:00 GMT

What happens if you replace:

vec(:)[1] = vec(:)[1] + vec(:)

with:

vec(:) = vec(:) + vec(:)

I tried it but it doen't help

Jan_W_2 — Thu, 14 May 2015 06:23:00 GMT

I tried it but it doen't help ... the calculation speed is as before.

Thank you

Jan

Is VTune telling you where

Steven_L_Intel1 — Thu, 14 May 2015 12:17:38 GMT

Is VTune telling you where those synchronization calls are coming from in your code? Trace Analyzer and Collector's timeline display can be helpful in understanding what is happening.

Dear Steve,

Jan_W_2 — Mon, 25 May 2015 09:14:01 GMT

Dear Steve,

I checked the Vtune analysis again, and it is telling me that the barriers are in the globalSum_serial function.

I also checked the times how long each image needs for summing. They all need more or less the same time.

Thank you,

Jan

What you are doing in your

reinhold-bader — Wed, 27 May 2015 09:26:51 GMT

What you are doing in your globalsum procedure is a cross-image reduction; while this is formally correct, it is quite inefficient. The statement that is particularly inefficient is the last communication statement

if(me /= 1) this(:) = vec(:)[1]

which oversubscribes the network link to image 1. The only "good" solution to this is using a collective call, which presently is not yet defined for coarray Fortran (but hopefully soon will be). For now, I think using MPI_Allreduce in its place should work (some MPI boilerplate may be needed). The alternative would be to implement the reduction manually, using all images (e.g. with a butterfly communication pattern) to reduce the amount of synchronization and avoid oversubscription.

Cheers

Reinhold