Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

slow coarray fortran program

peter_tromso
Beginner
1,616 Views
Hi

We have just started trying fortran coarrays. We observed a surprising feature.
Running a basic hello.f90 code works fine. However if I add a "sync all" at the beinning of the code, the program takes very long time (64 seconds) to execute, IF more than 15 images are used in a distributed architecture. If 15 or less images are used the program executes normally (3 seconds or so). I can't think of any process that so different for 15 or 16 images.
Of course coarrays are useless before we know how to fix this.

Thanks for any clue of what is going on here!

Peter Wind
System administrator
Troms, Norway

Details:

Cluster (704 nodes, 8 cores/node), http://www.notur.no/hardware/stallo/
we run on 2 nodes, infiniband network.

$ uname -a
Linux c18-3.local 2.6.18-194.11.3.el5 #1 SMP Fri Sep 17 09:50:20 PDT 2010 x86_64 x86_64 x86_64 GNU/Linux
$ ifort --version
ifort (IFORT) 12.0.4 20110427
$ cat hello.f90
program Hello_World
integer :: i ! Local variable
integer :: num
  • ! scalar coarray
    sync all ! Barrier to make sure the data has arrived
    if (this_image() == 1) then
    ! Distribute information to other images
    do i = 2, num_images()
    num = i*2
    end do
    end if
    sync all ! Barrier to make sure the data has arrived
    ! I/O from all nodes
    write(*,'(a,i,a,i0)') 'Hello ',num,' from image ', this_image()
    end program Hello_world

    $ ifort -coarray=distributed -coarray-config-file=conffile hello.f90
    $ cat conffile
    -envall -n 15 ./a.out

    $ time mpirun a.out
    Hello 20 from image 10
    Hello 22 from image 11
    Hello 0 from image 1
    Hello 24 from image 12
    Hello 26 from image 13
    Hello 28 from image 14
    Hello 30 from image 15
    Hello 4 from image 2
    Hello 8 from image 4
    Hello 18 from image 9
    Hello 6 from image 3
    Hello 14 from image 7
    Hello 12 from image 6
    Hello 10 from image 5
    Hello 16 from image 8

    real 0m3.880s
    user 0m0.623s
    sys 0m0.755s

    $ cat conffile
    -envall -n 16 ./a.out

    $ time mpirun a.out
    Hello 4 from image 2
    Hello 10 from image 5
    Hello 6 from image 3
    Hello 16 from image 8
    Hello 8 from image 4
    Hello 12 from image 6
    Hello 22 from image 11
    Hello 14 from image 7
    Hello 26 from image 13
    Hello 18 from image 9
    Hello 28 from image 14
    Hello 20 from image 10
    Hello 24 from image 12
    Hello 30 from image 15
    Hello 32 from image 16
    Hello 0 from image 1

    real 1m4.368s
    user 0m0.614s
    sys 0m0.723s




  • 0 Kudos
    5 Replies
    jimdempseyatthecove
    Honored Contributor III
    1,616 Views
    In your problem configuration I notice image 1 is displaying last. Is there a pause in print out between all the other images reports and image 1? (this might be helpful information for the development team).

    This may not bepertinant to this test, but you are not initializing num
  • for image 1, try initializing it and see if symptom changes.

    What happens when the first sync all is omitted?
    (in this example the first one is not required)

    Jim Dempsey
  • 0 Kudos
    Ron_Green
    Moderator
    1,616 Views
    We'll take a look at this, of course.

    But I do need to set your expectations: our Coarray implementation to date has focused on core functionality. We have done ZERO work on performance. If you are expecting this to perform anywhere near an MPI program, you are in for a disappointment.

    We will begin working on performance in the coming years. It will take a while to get performance where it should be with Coarrays. All new technology is like this, remember back to the first implementations of array syntax in Fortran 90 - initially performance was not good and some said to "keep writing F77 loops". Of course that the technology matured and now is comparable (or better) than hand-written loops. Same with Coarrays, it could take several years to get performance where we'd like it.

    Do not expect good performance at this early stage of development.

    ron
    0 Kudos
    Ron_Green
    Moderator
    1,616 Views
    Peter,

    Also, you should not be using mpirun or mpiexec to launch the program. Here are some notes for distributed memory CAF:

    http://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential-guide/

    Also, are you timing the job launch (mpiexec) or the program itself? I would put timers in the main program and have image 1 print out the actual application runtime. 'time' like you show is also capturing the job launch which may or may not be of interest.

    ron
    0 Kudos
    peter_tromso
    Beginner
    1,616 Views


    Thank you for your attention Ron,

    I didn't know that the developement was at such an early stage. We will then not yet recommend scientists to use CAF yet, if performance is an issue now.


    I will try to answer our remarks:

    mpiexec gives the same results (it is where I started actually)
    I also have read in details the link you refere to.

    When the first "sync all" is omitted there is no problem. However test codes show the same slow down if a similar loop is present after any "sync all".

    Initializing num[1], give the same result (in matter of runtime. image 1 appears at a random place in the order of output. third place for example)

    In the example above, there is no pause in the "Hello" prints: all print out about simultaneously at the end of the run.
    I have also tested to put a write statement in the loop:
    do i = 1, num_images()
    write(*,*)i
    num = i*2
    end do

    Then there is no delay between the writes from i=1 to 10, then a long delay between 10 and 11, then varying delays between the others (!)

    Have you the possibility to run this test this in a similar configuration? (2 nodes * 8 cores. Infiniband is not necessary I have noticed). It would be interseting to know if there is some special feature in our configuration that triggers this behaviour.


    Peter
    0 Kudos
    peter_tromso
    Beginner
    1,616 Views

    When the first "sync all" is omitted there is no problem. However test codes show the same slow down if a similar loop is present after any "sync all".

    Initializing num[1], give the same result (in matter of runtime. image 1 appears at a random place in the order of output. third place for example)


    Peter
    0 Kudos
    Reply