Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Intel Fortran Coarray problem

Phy_Soham
Novice
557 Views

The following example code divides a large work (NFC_tot) into chunks. The Images then pick up a piece and start working whenever available and also update a variable (which resides only in Image-1) to let other images know that piece is done. This update is done inside a critical section, as shown in the code. Image 1 does not work in this dummy example.

The problem with ifort coarray: only one image (randomly) picks up work and starts working. While other images never pick a work. As a result, only one image does all the pieces serially (As shown in the output).

gfortran (with OpenCoarrays) does not produce this bug.

I am using Ifort Classic (version-ifort (IFORT) 2021.7.0 20220726).
gfortran version( 9.4.0)

Code:

program IntelCoarrayBug

    implicit none

    integer                                                 :: NFC_tot

    ! -------------------------------------- Dynamic Schedule -------------------------------------- !

    integer, allocatable, dimension(:,:), codimension[:]    :: ChnkArr
    integer                                                 :: MyChnkStrt[*]

    integer                                                 :: Nchunk
    integer, dimension(2)                                   :: MyAssignedChunk

    integer                                                 :: ch, strt, stp, &
                                                             & img, Chnklen, ChnkStride, ChnkStrt, &
                                                             & ChnkStop, Indxloop, PickFlag
    ! -------------------------------------- Dynamic Schedule -------------------------------------- !

    NFC_tot = 1280
    Nchunk = 22

    Chnklen = NFC_tot / Nchunk

    if ( this_image() == 1 ) write(*, 10) NFC_tot, Nchunk, Chnklen, num_images()

    allocate( ChnkArr(3, Nchunk)[*] )
    SYNC ALL

    OnlyImg1: if ( this_image() == 1 ) then

        ! ---------========= Devide the FCs to be covered into chunks of size Nchunk =========--------- !

        chnkLoop: do ch = 1, Nchunk

            strt = ((ch - 1) * Chnklen) + 1
            stp = strt + Chnklen - 1

            if ( (ch == Nchunk) .and. (stp /= NFC_tot) ) stp = NFC_tot

            ChnkArr(:, ch) = (/0, strt, stp/) !First element 0 => this chunk(strt->stop) not done
                                              !              1 => this chunk(strt->stop) done

        end do chnkLoop

        ! -------------================ Starting Position for each Image ================------------- !

        ChnkStride = Nchunk / ( num_images() - 1 )

        do img = 1, (num_images() - 1)

            ChnkStrt = ((img - 1) * ChnkStride) + 1
            ChnkStop = ChnkStrt + ChnkStride - 1

            if ( (img == (num_images()-1) ) .and. (ChnkStop /= Nchunk) ) ChnkStop = Nchunk

            ! ** img+1 ** !
            MyChnkStrt[img+1] = ChnkStrt
            ! ** img+1 ** !

        end do

        ! -------------================ Starting Position for each Image ================------------- !

        ! ---------========== print the Chunk and Processor assignment information ==========---------- !
        write(*, *)
        write(*, 20) this_image()
        writeLoop: do ch = 1, Nchunk

            write(*, 30) ChnkArr(:, ch), (ChnkArr(3, ch)-ChnkArr(2, ch)+1)

        end do writeLoop
        write(*, *)

    end if OnlyImg1

    SYNC ALL

    ! * Broadcast to all images * !
    call co_broadcast(ChnkArr, 1)

    write(*, 40) this_image(), MyChnkStrt
    call execute_command_line(' ')

    SYNC ALL
    ! ---------========== print the Chunk and Processor assignment information ==========---------- !

    if ( this_image() == 1 ) write(*, '(/)')

    SYNC ALL

    Indxloop = MyChnkStrt

    ! Image 1 will not work in this dummy test example, but all the other Images read value from ChnkArr 
    ! which resides in image 1, and also update it when a particular chunk is done (with value 1) as 
    ! implemented below
    NotImage1: if ( this_image() /= 1 ) then

        DynSchdl: do

            AccessImg1: CRITICAL

                PickFlag = ChnkArr(1, Indxloop) [1]

                if ( PickFlag == 0 ) ChnkArr(1, Indxloop) [1] = 1

            END CRITICAL AccessImg1

            StartWork: if ( PickFlag == 0 ) then

                MyAssignedChunk = ChnkArr(2:3, Indxloop)
                write(*, 50) this_image(), ChnkArr(2, Indxloop), ChnkArr(3, Indxloop)

                ! Dummy work
                call sleep( this_image() + &
                          & int( 0.01*(MyAssignedChunk(2)-MyAssignedChunk(1)) ) ) 

                write(*, 60) this_image(), ChnkArr(2, Indxloop), ChnkArr(3, Indxloop)

            end if StartWork

            ! Go to next chunk of ChnkArr in a circular manner and pickup work if that is not done
            Indxloop = mod( Indxloop, Nchunk ) + 1

            if ( Indxloop == MyChnkStrt ) EXIT DynSchdl

        end do DynSchdl

    end if NotImage1

    SYNC ALL

    10 FORMAT('NFC_tot = ', I6, ' , Nchunk = ', I6, ' , Chnklen = ', I6, ' Number of Images = ', I6)
    20 FORMAT('Writing Chunks info from image ', I5)
    30 FORMAT('[ ', 2(I6, ' , '), I6, ' ], len = ', I6)
    40 FORMAT('Image No. = ', I5, ' Starting Point = ', I6)
    50 FORMAT('Image No. = ', I5, ' : Started working on  ', I6, ' => ', I6)
    60 FORMAT('Image No. = ', I5, ' : Finished working on ', I6, ' => ', I6)

end program IntelCoarrayBug

ifort compilation: ifort -coarray=distributed -coarray-num-images=6 IntelCoarrayBug.f90
ifort result: ./a.out :

.
.
.
Image No. =     6 : Started working on     929 =>    986
Image No. =     6 : Finished working on    929 =>    986
Image No. =     6 : Started working on     987 =>   1044
Image No. =     6 : Finished working on    987 =>   1044
Image No. =     6 : Started working on    1045 =>   1102
Image No. =     6 : Finished working on   1045 =>   1102
Image No. =     6 : Started working on    1103 =>   1160
Image No. =     6 : Finished working on   1103 =>   1160
Image No. =     6 : Started working on    1161 =>   1218
Image No. =     6 : Finished working on   1161 =>   1218
.
.
.

 

gfortran compilation: gfortran -fcoarray=lib -L/opt/opencoarrays/lib IntelCoarrayBug.f90 -lcaf_mpi
gfortran result: mpirun -np 6 ./a.out :

.
.
.
Image No. =     5 : Started working on     697 =>    754
Image No. =     4 : Started working on     465 =>    522
Image No. =     6 : Started working on     929 =>    986
Image No. =     3 : Started working on     233 =>    290
Image No. =     2 : Started working on       1 =>     58
Image No. =     2 : Finished working on      1 =>     58
Image No. =     2 : Started working on      59 =>    116
Image No. =     3 : Finished working on    233 =>    290
Image No. =     3 : Started working on     291 =>    348
Image No. =     4 : Finished working on    465 =>    522
.
.
.

 

0 Kudos
1 Solution
jimdempseyatthecove
Black Belt
472 Views

Try replacing:

            AccessImg1: CRITICAL

                PickFlag = ChnkArr(1, Indxloop) [1]

                if ( PickFlag == 0 ) ChnkArr(1, Indxloop) [1] = 1

            END CRITICAL AccessImg1

with:

 

call atomic_fetch_or(ChnkArr(1, indxloop)[1]), 1, PickFlag)

Jim Dempsey

 

View solution in original post

0 Kudos
9 Replies
Barbara_P_Intel
Moderator
515 Views

I just compiled and ran it successfully like this:

$ ifort --version
ifort (IFORT) 2021.7.0 20220726
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.
$ ifort -coarray=shared -coarray-num-images=6 IntelCoarrayBug.f90
$ a.out
<snip>
Image No. =     6 : Started working on     929 =>    986
Image No. =     3 : Started working on     233 =>    290
Image No. =     5 : Started working on     697 =>    754
Image No. =     3 : Finished working on    233 =>    290
Image No. =     3 : Started working on     291 =>    348
Image No. =     5 : Finished working on    697 =>    754
Image No. =     5 : Started working on     755 =>    812
Image No. =     6 : Finished working on    929 =>    986
Image No. =     6 : Started working on     987 =>   1044
Image No. =     2 : Started working on       1 =>     58
Image No. =     3 : Finished working on    291 =>    348
<snip>

Can you try "shared"?

Works with ifx using "shared", too!

0 Kudos
Phy_Soham
Novice
503 Views

Hi   ,

 
 

Thanks a lot for your response.
I will install the new version and try ifx also.


It also works for '-coarray=distributed' when the number of processors given is large enough.
For a small number of images, only one image does the work serially in my case.


Also, work distribution is not uniform for Intel Coarray case.
This output from my actual application:


Intel Coarray:

Image No.     3: No. of call :1
Image No.     7: No. of call :0
Image No.     9: No. of call :5
Image No.    11: No. of call :7
Image No.    15: No. of call :17
Image No.     2: No. of call :7
Image No.     4: No. of call :12
Image No.     6: No. of call :16
Image No.     8: No. of call :17
Image No.    10: No. of call :0
Image No.    12: No. of call :17
Image No.    14: No. of call :7
Image No.    16: No. of call :10
Image No.     5: No. of call :8
Image No.    13: No. of call :0

gfortran OpenCoarrays:

Image No.     2: No. of call :9
Image No.     3: No. of call :8
Image No.     4: No. of call :8
Image No.     5: No. of call :8
Image No.     6: No. of call :9
Image No.     7: No. of call :8
Image No.     8: No. of call :8
Image No.     9: No. of call :8
Image No.    10: No. of call :8
Image No.    11: No. of call :8
Image No.    12: No. of call :9
Image No.    13: No. of call :9
Image No.    14: No. of call :8
Image No.    15: No. of call :8
Image No.    16: No. of call :8

Image 7 and Image 13 never get any call in this particular run, for example. Only some Images get most of the call. Execution time is much higher as a result.

Can you give any insight about this?

0 Kudos
jimdempseyatthecove
Black Belt
473 Views

Try replacing:

            AccessImg1: CRITICAL

                PickFlag = ChnkArr(1, Indxloop) [1]

                if ( PickFlag == 0 ) ChnkArr(1, Indxloop) [1] = 1

            END CRITICAL AccessImg1

with:

 

call atomic_fetch_or(ChnkArr(1, indxloop)[1]), 1, PickFlag)

Jim Dempsey

 

0 Kudos
Barbara_P_Intel
Moderator
466 Views

The Intel Fortran compilers implement CAF (coarray Fortran) using Intel MPI. For "distributed" some setup is involved. This article describes the setup, defining what servers you want to run on and how to compile. 

hmmm... the article needs updating. I'll do that. In the meantime, skip Step 2.

 

0 Kudos
Barbara_P_Intel
Moderator
441 Views

I learned something new about the CAF implementation from the compiler developer that I thought I should share.

When you compile with "distributed" and with no configuration file like you are doing,

ifort -coarray=distributed -coarray-num-images=6 coarray.f90

at runtime the executable defaults to "shared" on the current machine.

I compiled with "distributed" (as above) and ran it. The output is fine and used all 6 images.

 

0 Kudos
Phy_Soham
Novice
463 Views

Hi all,
Thank you all for your comment.

I think the critical section is still needed as atomic_fetch_or can not avoid the race condition between images. ( See the output of the program below). I see no improvement with calling atomic_fetch_or inside the critical section.

I am trying to avoid atomic operations as other implementations do not have this feature yet, so portability may be an issue.

Thanks for the link to the article on intel caf.

 

program atomic_fetch_test

    implicit none

    integer                     :: atom [*]
    integer                     :: read_atom

    atom = 0

    SYNC ALL

    if ( this_image() /= 1 ) then

        call atomic_fetch_or( atom[1], 1, read_atom )
        write(*, *) 'Image number ', this_image(), ' reads atom value: ', read_atom
        call execute_command_line(' ')

    end if

    SYNC ALL


    if ( this_image() == 1) write(*, *) 'atom value on image 1 after operation of ', num_images(), &
                                      & ' images : ', atom[1] , ' correct result = ', (num_images() - 1)

    ! result is undefined between runs

end program atomic_fetch_test
0 Kudos
Phy_Soham
Novice
459 Views

Sorry,

It is bit wise or operation.

atomic_fetch_or serves the purpose correctly.

Thank you all.

0 Kudos
jimdempseyatthecove
Black Belt
459 Views

>>I think the critical section is still needed as atomic_fetch_or can not avoid the race condition between images.

Your test code is incorrect. There should be no race condition.

 

The atomic_fetch_or does not prevent a rank from or-ing a 1 with a 1. It is not intended to do this. The intention is for only one thread to receive the 0.

 

program atomic_fetch_test

    implicit none

    integer                     :: atom [*]
    integer                     :: count [*]
    integer                     :: read_atom

    atom = 0
    count = 0

    SYNC ALL

    if ( this_image() /= 1 ) then

        call atomic_fetch_or( atom[1], 1, read_atom )
        if(read_atom == 0) call atomic_add(count[1], 1)
        write(*, *) 'Image number ', this_image(), ' reads atom value: ', read_atom
        call execute_command_line(' ')

    end if

    SYNC ALL


    if ( this_image() == 1) write(*, *) 'count value on image 1 after operation of ', num_images(), &
                                      & ' images : ', count[1] , ' correct result = 1'

end program atomic_fetch_test

Expected results:

Only one rank reports 0 for read_atom, all others report 1

At end of program, count == 1

 

Jim Dempsey

 

jimdempseyatthecove
Black Belt
440 Views

Should this report a warning (or log to event file)?

 

Jim Dempsey

0 Kudos
Reply