Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

No progress on coarray access until synchronization

Nathan_Weeks
Beginner
509 Views

With ifort 15.0.0 for Linux, I've encountered an issue where images >= 2 are
unable to access (read) a coarray with cosubscript 1 until image 1 encounters
a subsequent image control statement. Since I'm quite new to coarrays, I'd
like to know if I'm misunderstanding a subtlety regarding possible coarray
behavior, or if this behavior is unexpected:

The following program demonstrates the issue:

program test_coarray
   implicit none
   integer :: i, X(2,2)
  • double precision :: time, val = 1.0 if (THIS_IMAGE() == 1) then X = 1 SYNC IMAGES(*) call CPU_TIME(time) write (*,*) THIS_IMAGE(), ': after first sync', time do i = 1,2**30 val = val + COS(DBLE(i)) end do call CPU_TIME(time) write (*,*) THIS_IMAGE(), ': before second sync', time SYNC IMAGES(*) call CPU_TIME(time) write (*,*) THIS_IMAGE(), ': after second sync', time else SYNC IMAGES(1) call CPU_TIME(time) write (*,*) THIS_IMAGE(), ': after first sync', time X = X[1] call CPU_TIME(time) write (*,*) THIS_IMAGE(), ': before second sync', time SYNC IMAGES(1) call CPU_TIME(time) write (*,*) THIS_IMAGE(), ': after second sync', time end if call cpu_time(time) write (*,*) THIS_IMAGE(), 'finished:', time, 'result:', SUM(X)+val end program test_coarray
  • Images 2-4 apparenlty don't complete the read of X[1] until after image 1 has
    reached the second SYNC IMAGES:

    $ ifort -coarray=shared -coarray-num-images=4 test_coarray.f90
    $ ./a.out
               1 : after first sync  9.998000000000000E-003
               2 : after first sync  1.099700000000000E-002
               3 : after first sync  1.099700000000000E-002
               4 : after first sync  1.099700000000000E-002
               1 : before second sync   14.7887510000000
               1 : after second sync   14.7887510000000
               2 : before second sync   14.7897500000000
               2 : after second sync   14.7897500000000
               2 finished:   14.7897500000000      result:   5.00000000000000
               3 : before second sync   14.7907500000000
               3 : after second sync   14.7907500000000
               4 : before second sync   14.7897500000000
               4 : after second sync   14.7897500000000
               4 finished:   14.7897500000000      result:   5.00000000000000
               1 finished:   14.7887510000000      result:   4.32834934995579
               3 finished:   14.7907500000000      result:   5.00000000000000

     

    0 Kudos
    6 Replies
    pbkenned1
    Employee
    509 Views

    I don't see any control synchronization issues with ifort 15.0.1.  Perhaps what you observed was by chance?  The order of syncing/finishing is indeterminate, but occasionally images 2-4 will make progress before image 1 reaches the second sync.

     

    [U538012]$ ifort -V
    Intel(R) Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 15.0.1.133 Build 20141023
    Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

    [pbkenned@dpdmic09 U538012]$ ifort -coarray=shared -coarray-num-images=4 U538012.f90 -o U538012.x
    [pbkenned@dpdmic09 U538012]$ ./U538012.x

    [U538012]$ ./U538012.x
               3 : after first sync  8.997000000000000E-003
               4 : after first sync  8.998000000000001E-003
               1 : after first sync  1.199700000000000E-002
               2 : after first sync  1.099800000000000E-002
               1 : before second sync   13.1350020000000
               2 : before second sync   13.1370010000000
               2 : after second sync   13.1370010000000
               3 : before second sync   13.1350020000000
               3 : after second sync   13.1350020000000
               4 : before second sync   13.1340020000000
               4 : after second sync   13.1340020000000
               4 finished:   13.1340020000000      result:   5.00000000000000
               1 : after second sync   13.1350020000000
               1 finished:   13.1350020000000      result:   4.32834934995579
               2 finished:   13.1370010000000      result:   5.00000000000000
               3 finished:   13.1350020000000      result:   5.00000000000000
     

    Patrick

     

    0 Kudos
    Nathan_Weeks
    Beginner
    509 Views

    Hi Patrick,

    While it your output demonstrates that image 4 may have made progress slightly before image 1 executed SYNC IMAGES(*), it still demonstrates the general problem in this code segment executed by images 2-4:

    else
        SYNC IMAGES(1)
        call CPU_TIME(time)
        write (*,*) THIS_IMAGE(), ': after first sync', time
        X = X[1]
        call CPU_TIME(time)
        write (*,*) THIS_IMAGE(), ': before second sync', time

    It takes images 2-4 about 13 seconds to execute the two write statements and the assignment between the calls to CPU_TIME(); this is apparent when the output is reordered thus:

    1 : after first sync  1.199700000000000E-002

    1 : before second sync   13.1350020000000

    2 : after first sync  1.099800000000000E-002

    2 : before second sync   13.1370010000000

    3 : after first sync  8.997000000000000E-003

    3 : before second sync   13.1350020000000

    4 : after first sync  8.998000000000001E-003

    4 : before second sync   13.1340020000000

    It is at least counterintuitive that such one-sided "gets" of relatively little data would take that long. I've observed similar behavior in a larger code where where the "work" loop image 1 did took minutes---and images > 1 seem to stall until the time of the next synchronization with image 1 before completing their one-sided "gets" to retrieve the data needed to begin their work (if you wish, I could arrange for you to receive the code & data).

    Many thanks for looking into this!

    0 Kudos
    pbkenned1
    Employee
    509 Views

    Hi Nathan,

    Thanks for the feedback, I understand what you're saying.  I simplified the program by only doing one SYNC IMAGES() for each image:

    [U538012]$ cat U538012-one-sync-img.f90
    program test_coarray
       implicit none
       integer :: i, X(2,2)


  •    double precision :: time, val = 1.0
  •    if (THIS_IMAGE() == 1) then
          X = 1
          SYNC IMAGES(*)
          call CPU_TIME(time)
          write (*,*) THIS_IMAGE(), ': before image 1 loop', time
          do i = 1,2**30
             val = val + COS(DBLE(i))
          end do
          call CPU_TIME(time)
          write (*,*) THIS_IMAGE(), ': after image 1 loop', time
       else
          SYNC IMAGES(1)
          call CPU_TIME(time)
          write (*,*) THIS_IMAGE(), ': before X = X[1]', time
          X = X(:,:)[1]
          call CPU_TIME(time)
          write (*,*) THIS_IMAGE(), ': after X = X[1]', time
       end if
       call cpu_time(time)

       write (*,*) THIS_IMAGE(), 'finished:', time, 'result:', SUM(X)+val

    end program test_coarray

     

    [U538012]$ ifort -coarray=shared -coarray-num-images=4 U538012-one-sync-img.f90 -o U538012-one-sync-img.x
    [U538012]$ ./U538012-one-sync-img.x
               1 : before image 1 loop  4.998000000000000E-003
               2 : before X = X[1]  4.998000000000000E-003
               3 : before X = X[1]  5.998000000000000E-003
               4 : before X = X[1]  6.998000000000000E-003
               1 : after image 1 loop   13.1639980000000
               1 finished:   13.1639980000000      result:   4.32834934995579
               2 : after X = X[1]   13.1629980000000
               2 finished:   13.1629980000000      result:   5.00000000000000
               3 : after X = X[1]   13.1639980000000
               3 finished:   13.1639980000000      result:   5.00000000000000
               4 : after X = X[1]   13.1649980000000
               4 finished:   13.1649980000000      result:   5.00000000000000
    [U538012]$
     

     

    It's clear the images 2-4 all wait for image 1 to complete the 'work' loop and finish the 'SUM(X)+val' calculation before progressing.  I'm not sure if this is expected or not, but certainly images 2-4 would not compute identical values for the 'result' if image 1 hadn't completed its 'work' loop.  I'll inquire with the developers.

     

    Thanks,

    Patrick

    0 Kudos
    pbkenned1
    Employee
    509 Views

    Apparently there have been some issues with SYNC IMAGES(*), in particular, when referencing your own image, ie, the test case image one code.  I was encouraged to report this, which I have done, tracking ID DPD200365175.  I'll keep this thread updated with any progress.

     

    Patrick

    0 Kudos
    John_D_6
    New Contributor I
    509 Views

    Hi Nathan,

    I noticed the same thing. The solution seems to be to set the asynchronous progress for the Intel MPI-library. You can do this by setting the environment variable MPICH_ASYNC_PROGRESS:

    export MPICH_SYNC_PROGRESS=1

    and your example code runs as expected:

    $ MPICH_ASYNC_PROGRESS=1 ./forum
               3 : before X = X[1]  7.998000000000000E-003
               3 : after X = X[1]  7.998000000000000E-003
               3 finished:  7.998000000000000E-003 result:   5.00000000000000    
               2 : before X = X[1]  6.998000000000000E-003
               2 : after X = X[1]  7.998000000000000E-003
               2 finished:  7.998000000000000E-003 result:   5.00000000000000    
               1 : before image 1 loop  7.998000000000000E-003
               4 : before X = X[1]  5.998000000000000E-003
               4 : after X = X[1]  5.998000000000000E-003
               4 finished:  5.998000000000000E-003 result:   5.00000000000000    
               1 : after image 1 loop   28.5566580000000    
               1 finished:   28.5576570000000      result:   4.32834934995579    

    However, it does create another problem: the SYNC MEMORY statement (not in your code, but it is in mine) often hangs the application (apparently after references to coarray variables). I could circumvent the problem by using SYNC IMAGES(THIS_IMAGE()) as an alternative for SYNC MEMORY, which should have at least the same effect as a SYNC MEMORY statement according to the standard.

    And one more warning: the asynchronous progress also changes the thread support of the used MPI-library from MPI_THREAD_FUNNELED to MPI_THREAD_MULTIPLE. This could cause lower MPI performance, if you're combining MPI and Coarray Fortran in one application.

    0 Kudos
    Nathan_Weeks
    Beginner
    509 Views

    Hi John,

    Many thanks for sharing that info. Indeed, after setting the MPICH_ASYNC_PROGRESS environment variable, I was able to remove an extra set of SYNC ALL statements from our larger coarray program. While it's probably a separate issue, this larger program works only with I_MPI_FABRICS=shm:tcp , as "shm:dapl" or "shm:ofa" result in incorrect results or run-time segfaults.

    0 Kudos
    Reply