Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Coarrays

OP1
New Contributor III
749 Views

I am wondering if coarrays could be used for the following scenario:

I have a subroutine that calculates Y=F(X), X and Y being vectors of dimension, say, 10000. F is being called hundreds or thousands of time in an iterative process.

F has properties such that the calculation of Y(i) is completely independent from other calculations; in other words we can write Y(i) = Fi(X(i)), where Fi is a specific function (distinct subroutine) which may not be related at all to Fj, i/=j . Each individual calculation Y(i) = Fi(X(i)) is already fairly substantial and we can for now neglect concerns due to communication overheads between images. In practice, there may only be dozens of such individual "basic" functions Fi, that would be applied to all terms of X.

The basic idea is the following:

1. The master thread provides X to F.
2. A bunch of slave images are assigned the calculation of the Y(i); a task stack is used so that the images that complete faster than the others can grab another calculation for another element of X.
3. The master thread collects the components of Y as they are calculated.

It's nice on paper, but the synchronization mechanisms required seem a bit tedious to implement. I am wondering if anybody started looking into similar applications of coarrays. Ideally, this would be used on a cluster with a large number of nodes.

Thanks!

0 Kudos
4 Replies
jimdempseyatthecove
Honored Contributor III
749 Views

I would suggest starting with OpenMP

!$OMP PARALLEL DO
DO I=1,NY
  Y(I) = F(X(I))
END DO
!$OMP END PARALLEL DO

Then, if the SMP box shows insufficient performance, then consider redefining Y and X in ranked subsections.

NYranked = (NY + NumberOfRanks - 1) / NumberOfRanks

Then each image has an Xranked[...] and Yranked[...] of dimension 1:NYranked, and where the last image may be partially filled. Note, when  the last rank is not fully filled, the load can be balanced by taking the last entry from some of the first ranks. Example: 4 ranks NY=13

4,4,4,1 (no borrowing)
4,3,3,3 (borrowing)

You should be able to write both programs relatively quickly and see which performs better.

Jim Dempsey

0 Kudos
OP1
New Contributor III
749 Views

Thanks Jim,

The implementation has to be either coarray or MPI - not OpenMP. Right now the strategy envisionned is to:

1. Define a series of modules defining all 'tasks types' that can be performed at any given time by any thread, with task-specific coarrays (derived types) used for communication of inputs and results back-and-forth between slave images and master image.
2. Use a master thread to initialize the program data; then to request initialization on all images of coarrays (allocation of task-specific coarrays, etc.).
3. Create a task stack where images can retrieve tasks and the associated data; notify the master image that a task is done and that the data can be retrieved; put additional tasks on the stack.

The synchronization mechanism between the master image and the slave images is not trivial at all. This is somewhat akin to implementing an event-driven programming approach with the very linear, traditional Fortran approach, but I am wondering if this is stretching a bit what coarrays can do. I think it can be done; but this is not straightforward it seems.

0 Kudos
Steven_L_Intel1
Employee
749 Views

The structure you describe is one common way to use coarrays. As with MPI, you want to have the individual workers do a lot of work for a given amount of data, as moving the data around can be slow. You'd probably want to use LOCK/UNLOCK or a CRITICAL section to protect the work list while it is being manipulated.

0 Kudos
jimdempseyatthecove
Honored Contributor III
749 Views

Task stack is one way, task ring buffer is another. If you structure the ring buffers as single producer/single consumer, you won't need Lock/Unlock and/or critical. You will need though to be careful to structure the ring buffer such that a consumer image does not begin to process a task message until all of it has been transferred. This can be performed as simple as having the producer NOT bump the fill index until after a sync following a message copy to the image. The consumer need not sync on every task message. Only on the last one or when the input ring buffer appears empty. The output ring buffer sync-ing responsibility is flipped around since the consumer of the do-work tasks is the producer of the work done.

While you can pass MPI messages in lock-step, I believe performance will be better as outlined above. Run test to assure reliability and performance.

Jim Dempsey

0 Kudos
Reply