Haphazard blocking with a simple program using coarrays

Arjen_Markus · ‎01-31-2020

I wrote a small program exploiting coarrays, to show that one image can provide the data for a bunch of others - a simple demonstration. But I noticed that running the program is not entirely reliable: quite often it works, but at times - completely unpredictable to me - it gets stuck. Killing the process and then re-runing the very same program usually solves that problem. This happens both on Windows and on Linux. When it runs, it produces the output I expect. When it does not run, it appears to be stuck right at the first sync all statement. From this I conclude that the problem may not be the program itself.

Any clues as to what might be going wrong? Or better: how to prevent this?

FYI, I am using Intel Fortran 18.0.5.274 on Windows and a similar version on Linux.

I have attached the program. To run: dwaq_dlfow 1 2 3 - the arguments are used to select the number of active images and a value to be passed.

Steve_Lionel · ‎01-31-2020

Based on some initial experimentation, my guess is that you have an imbalance in SYNC ALL executions, made worse by simply skipping over images that are not asked to do anything and thus don't do their own SYNC ALLs. I tried a compile with /Qcoarray-num-images:3 to match the number of workers, and it got a lot further, until one of the hydro workers declared "done", then it hung.

I haven't studied your code in detail, but you might be better served by using SYNC MEMORY rather than SYNC ALL in some places if you just want to make sure that updates from other images are recognized.

I'd also suggest not relying on COMMAND_ARGUMENT_COUNT and GET_COMMAND_ARGUMENT working in images other than 1. They do here, but I don't think the standard requires it. Typically, command processing is done in image 1 which then sets up other images and releases them to execute.

Arjen_Markus · ‎01-31-2020

Ah, I let myself be fooled by the fact that it worked without a problem in the beginning ;). Thanks for the tips. I will try to improve it along the suggested lines.

Arjen_Markus · ‎02-10-2020

Finally got around to implementing Steve's suggestion (at least as far as the idle images are concerned): that did the trick - I have run the adjusted program at least a dozen times on both platforms and there was not a single failure.

Steve_Lionel · ‎02-10-2020

Glad to hear it.

SYNC ALL and SYNC IMAGE statements do not progress until each image has executed that statement the same number of times. There are also some implied synchronization points. The standard lists all of the "image control statements":

Execution of an image control statement divides the execution sequence on an image into segments. Each of the following is an image control statement:
• SYNC ALL statement;
• SYNC IMAGES statement;
• SYNC MEMORY statement;
• SYNC TEAM statement;
• ALLOCATE or DEALLOCATE statement that has a coarray allocate-object;
• CHANGE TEAM or END TEAM statement (11.1.5);
• CRITICAL or END CRITICAL statement (11.1.6);
• EVENT POST or EVENT WAIT statement;
• FORM TEAM statement;
• LOCK or UNLOCK statement;
• any statement that completes execution of a block or procedure and which results in the implicit deallocation of a coarray;
• a CALL statement that references the intrinsic subroutine MOVE_ALLOC with coarray arguments;
• STOP statement;
• END statement of a main program.