Gather strategy for coarrays

OP1 · ‎04-15-2015

In the trivial code below, all images perform some task, which output needs to be gathered on the first image (for further processing, for instance). Two strategies are implemented: the first one consists in having image 1 go hunt for the results on other images; the second one consists in all images writing to the local coarray of image 1.

My questions are the following:

Is one of these two approaches more efficient over the other one? Intuitively one might say (2) ... but on the other hand nothing is really intuitive or obvious with parallel codes...
Is the second one even standard compliant? In particular, I have in mind this paragraph in the Intel Compiler Fortran 14 documentation: "If a variable is defined on an image in a segment, it must not be referenced, defined, or become undefined in a segment on another image unless the segments are ordered.". Line 36 of the code forms a segment that is not ordered when executed by the code images, and yet these access the same array component of the same coarray variable... but not the same array location... In other words, if the programmer is careful to avoid overlap (that is, images writing by mistake to the same location) this *should* work... but maybe I am just lucky here (as it seems to work)?

Comments and suggestions would be very much appreciated.
Thanks!

PROGRAM P

! Declarations.
IMPLICIT NONE
TYPE T_DATA_CONTAINER
    INTEGER,ALLOCATABLE :: GLOBAL_RESULTS(:)
    INTEGER :: LOCAL_RESULTS
END TYPE T_DATA_CONTAINER
TYPE(T_DATA_CONTAINER) :: A
INTEGER :: I

! Image 1 allocates A%RESULTS for storage of results from all images.
IF (THIS_IMAGE()==1) ALLOCATE(A%GLOBAL_RESULTS(NUM_IMAGES()))
SYNC ALL

! All images do some work and fill the coarray used for communication.
A%LOCAL_RESULTS = THIS_IMAGE()
SYNC ALL

! 1st data collection strategy: image 1 gathers data from all other images.
IF (THIS_IMAGE()==1) THEN
    DO I=1,NUM_IMAGES()
        A%GLOBAL_RESULTS(I) = A%LOCAL_RESULTS
    END DO
END IF
IF (THIS_IMAGE()==1) WRITE(*,*) 'RESULTS: ',A%GLOBAL_RESULTS
SYNC ALL

! All images do some work and fill the image-specific portion of their local coarray.
A%LOCAL_RESULTS = THIS_IMAGE()*2
SYNC ALL

! 2nd data collection strategy: images write simultaneously to the local coarray
! component on image 1. 
! Care must be taken to avoid distinct images writting to the same elements!
A[1]%GLOBAL_RESULTS(THIS_IMAGE()) = A%LOCAL_RESULTS
SYNC ALL
IF (THIS_IMAGE()==1) WRITE(*,*) 'RESULTS: ',A%GLOBAL_RESULTS

END PROGRAM P

Steven_L_Intel1 · ‎04-17-2015

Generally I see coarray programs take the path of having image 1 collect all the results. My feeling is that this is better than having other images update image 1. Note that it is best if you can do fewer cross-image transactions - read or write array sections rather than individual elements if possible.

OP1 · ‎04-17-2015

Thanks for sharing your insights on this Steve. I will do timing tests to assess the best option out of the two; I was more concerned about the possibility that (2) was illegal - from your answer I understand it's ok but maybe not the best in terms of efficiency.

Although I haven't tried to kick the tires with the latest beta yet (16), the Intel coarray implementation is slowly but surely gathering momentum (there are still bugs and performance issues of course) and this is great. Now we just need teams, events, and parallel I/O asap, ha ha :-) , as this would simplify considerably (considerably!) code architecture.

David_DiLaura1 · ‎04-17-2015

I have worked with and tested coarray programs that produce and need to gather up large amounts computed results. I have found direct communication between coarray program images to be significantly slower than the (clumsy-at-first-glace) method of having each image write its data to a throw-away temporary binary file and then have image 1 open/read each temp file and process the accumulated data as required. Image 1 is sync'd with the others at the point just after they've written and closed their temp file.

The difference in execution times I have observed are significant -- though they probably depend on local machine particulars and the amount of inter-image data that needs to be shared. Nevertheless, you may want to try this to see if you get the same performance that I have observed.

David

Steven_L_Intel1 · ‎04-17-2015

I think your second approach is legal, just not optimal.

IanH · ‎04-18-2015

David DiLaura wrote:
...I have found direct communication between coarray program images to be significantly slower than the (clumsy-at-first-glace) method of having each image write its data to a throw-away temporary binary file and then have image 1 open/read each temp file and process the accumulated data as required...

Note that "whether a named file on one image is the same as a file with the same name on another image is processor dependent". Consequently I don't think the approach described above is portable.

David_DiLaura1 · ‎04-18-2015

Ian,

Each file must be uniquely named. In my work I generate a name peculiar to the project and end it with 'QQ' (as is typically done to keep file names from coinciding with common words/phrases) and then the image number is turned into character(s) and appended to the end of the file name. Each image thus has it's own file, helping (some what) the efficiently.

David

IanH · ‎04-18-2015

That fragment of standard text means that there is no guarantee that a file operated on by one image is accessible from another.

In terms of implementation, there is no requirement that the images all be executing on machines that all see the same file system.

Equally there is no requirement that all images be executing on machines with completely isolated file systems. Uniqueness of the names of files being written to is hence required for portability, but it is not sufficient for connecting to files across images.

Michael_S_17 · ‎05-08-2015

Hi,

I believe the writing strategie (your second one) could be of advantage in other situations, when you would like to buffer the transfered values in the PGAS memory (coarrays) of a foreign image. (In your above example image 1 would be that foreign image). I mean, as far as my current understanding goes, if you go hunt a value from a foreign image, your current image has to wait with further processing until it has received that value. On the other hand, if you write to a foreign image, none image has necessarily to wait until transmission has completed. Actually, I use this, but don't use the values in PGAS memory (coarrays) directly in my program logic code. Rather I do copy the coarray values to completely local memory (non-coarray variables) before using them in my program logic code. Thus, that foreign image can do some further processing while the transmission takes place.

BTW, using derived type coarrays with static array components (instead of the allocatable one in your above example), does make the image-to-image writing much easier, because you don't have to care about the allocations on foreign images. Futher, this should also improve performance because your coarray would become symmetric. (see Aleksandar Donevs 'Rationale for Co-Arrays in Fortran 2008' for a good explanation).

best regards

michael