slow coarray fortran program

peter_tromso · ‎09-05-2011

Hi

We have just started trying fortran coarrays. We observed a surprising feature.
Running a basic hello.f90 code works fine. However if I add a "sync all" at the beinning of the code, the program takes very long time (64 seconds) to execute, IF more than 15 images are used in a distributed architecture. If 15 or less images are used the program executes normally (3 seconds or so). I can't think of any process that so different for 15 or 16 images.
Of course coarrays are useless before we know how to fix this.

Thanks for any clue of what is going on here!

Peter Wind
System administrator
Troms, Norway

Details:

Cluster (704 nodes, 8 cores/node), http://www.notur.no/hardware/stallo/
we run on 2 nodes, infiniband network.

$ uname -a
Linux c18-3.local 2.6.18-194.11.3.el5 #1 SMP Fri Sep 17 09:50:20 PDT 2010 x86_64 x86_64 x86_64 GNU/Linux
$ ifort --version
ifort (IFORT) 12.0.4 20110427
$ cat hello.f90
program Hello_World
integer :: i ! Local variable
integer :: num

! scalar coarray
sync all ! Barrier to make sure the data has arrived
if (this_image() == 1) then
! Distribute information to other images
do i = 2, num_images()
num = i*2
end do
end if
sync all ! Barrier to make sure the data has arrived
! I/O from all nodes
write(*,'(a,i,a,i0)') 'Hello ',num,' from image ', this_image()
end program Hello_world

$ ifort -coarray=distributed -coarray-config-file=conffile hello.f90
$ cat conffile
-envall -n 15 ./a.out

$ time mpirun a.out
Hello 20 from image 10
Hello 22 from image 11
Hello 0 from image 1
Hello 24 from image 12
Hello 26 from image 13
Hello 28 from image 14
Hello 30 from image 15
Hello 4 from image 2
Hello 8 from image 4
Hello 18 from image 9
Hello 6 from image 3
Hello 14 from image 7
Hello 12 from image 6
Hello 10 from image 5
Hello 16 from image 8

real 0m3.880s
user 0m0.623s
sys 0m0.755s

$ cat conffile
-envall -n 16 ./a.out

$ time mpirun a.out
Hello 4 from image 2
Hello 10 from image 5
Hello 6 from image 3
Hello 16 from image 8
Hello 8 from image 4
Hello 12 from image 6
Hello 22 from image 11
Hello 14 from image 7
Hello 26 from image 13
Hello 18 from image 9
Hello 28 from image 14
Hello 20 from image 10
Hello 24 from image 12
Hello 30 from image 15
Hello 32 from image 16
Hello 0 from image 1

real 1m4.368s
user 0m0.614s
sys 0m0.723s

jimdempseyatthecove · ‎09-06-2011

In your problem configuration I notice image 1 is displaying last. Is there a pause in print out between all the other images reports and image 1? (this might be helpful information for the development team).

This may not bepertinant to this test, but you are not initializing num

for image 1, try initializing it and see if symptom changes.

What happens when the first sync all is omitted?
(in this example the first one is not required)

Jim Dempsey

Ron_Green · ‎09-06-2011

We'll take a look at this, of course.

But I do need to set your expectations: our Coarray implementation to date has focused on core functionality. We have done ZERO work on performance. If you are expecting this to perform anywhere near an MPI program, you are in for a disappointment.

We will begin working on performance in the coming years. It will take a while to get performance where it should be with Coarrays. All new technology is like this, remember back to the first implementations of array syntax in Fortran 90 - initially performance was not good and some said to "keep writing F77 loops". Of course that the technology matured and now is comparable (or better) than hand-written loops. Same with Coarrays, it could take several years to get performance where we'd like it.

Do not expect good performance at this early stage of development.

ron

Ron_Green · ‎09-06-2011

Peter,

Also, you should not be using mpirun or mpiexec to launch the program. Here are some notes for distributed memory CAF:

http://software.intel.com/en-us/articles/distributed-memory-coarray-fortran-with-the-intel-fortran-compiler-for-linux-essential-guide/

Also, are you timing the job launch (mpiexec) or the program itself? I would put timers in the main program and have image 1 print out the actual application runtime. 'time' like you show is also capturing the job launch which may or may not be of interest.

ron

peter_tromso · ‎09-07-2011

Thank you for your attention Ron,

I didn't know that the developement was at such an early stage. We will then not yet recommend scientists to use CAF yet, if performance is an issue now.

I will try to answer our remarks:

mpiexec gives the same results (it is where I started actually)
I also have read in details the link you refere to.

When the first "sync all" is omitted there is no problem. However test codes show the same slow down if a similar loop is present after any "sync all".

Initializing num[1], give the same result (in matter of runtime. image 1 appears at a random place in the order of output. third place for example)

In the example above, there is no pause in the "Hello" prints: all print out about simultaneously at the end of the run.
I have also tested to put a write statement in the loop:
do i = 1, num_images()
write(*,*)i
num = i*2
end do

Then there is no delay between the writes from i=1 to 10, then a long delay between 10 and 11, then varying delays between the others (!)

Have you the possibility to run this test this in a similar configuration? (2 nodes * 8 cores. Infiniband is not necessary I have noticed). It would be interseting to know if there is some special feature in our configuration that triggers this behaviour.

Peter

peter_tromso · ‎09-07-2011

When the first "sync all" is omitted there is no problem. However test codes show the same slow down if a similar loop is present after any "sync all".

Initializing num[1], give the same result (in matter of runtime. image 1 appears at a random place in the order of output. third place for example)

Peter