- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Is there a way to resolv coarray sync issues with vtune? Is there maybe
a tutorial for this?
Thank you
Jan
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please describe the problem in more detail. VTune Amplifier XE won't help with coarray issues as MPI is used. Intel Trace Analyzer and Collector can help with this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Steve,
it is a transpose-free QMR solver. I want to use coarrays for parallelize the matrix-vector product.
Unfortunally, if I use more than one image the calculations slow down.
This is the matrix-vector product routine:
subroutine matvec(a,x,y) type(coo_matrix), intent(in) :: a complex(dp), dimension(:), intent(in) :: x complex(dp), dimension(size(x,1)), intent(out) :: y complex(dp), allocatable,dimension(:) :: tmp[:] integer :: i, me, numi me = this_image() numi = num_images() !allocate allocate(tmp(a%n)
And the sum function is:
function globalSum_serial(vec,n) result(this) complex(dp), dimension(:), intent(inout) :: vec
I compile this with ifort 15.0.1 using only -coarray.
When I use a big matrix the more images I use the matvec routine will slow down.
I did a basic hotspot analysis with vtune and it says that
ICAF_BARRIER and ICAF_UNLOC are the code segments which need the most time.
Thank you jan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What happens if you replace:
vec(:)[1] = vec(:)[1] + vec(:)
with:
vec(:) = vec(:) + vec(:)
?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried it but it doen't help ... the calculation speed is as before.
Thank you
Jan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is VTune telling you where those synchronization calls are coming from in your code? Trace Analyzer and Collector's timeline display can be helpful in understanding what is happening.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What you are doing in your globalsum procedure is a cross-image reduction; while this is formally correct, it is quite inefficient. The statement that is particularly inefficient is the last communication statement
if
(me /= 1) this(:) = vec(:)[1]
which oversubscribes the network link to image 1. The only "good" solution to this is using a collective call, which presently is not yet defined for coarray Fortran (but hopefully soon will be). For now, I think using MPI_Allreduce in its place should work (some MPI boilerplate may be needed). The alternative would be to implement the reduction manually, using all images (e.g. with a butterfly communication pattern) to reduce the amount of synchronization and avoid oversubscription.
Cheers
Reinhold
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page