Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Distributed Coarray Fortran: misunderstanding/bug?

Jeremie_V_
Beginner
506 Views

Dear,

Below is a Coarray Fortran program that gives me some troubles:

A large vector (x) is updated in two different ways. For large sizes of x, the updates of x is wrong WHEN each image is on a different node. WHEN all images are on the same node, results are always fine, whatever the size of x.

When size(x)=10^6, exchanging the full array across images on different nodes led to wrong results. However, exchanging small subsets of x led to correct results.

When size(x)>2*10^7, exchanging the full array across images on different nodes led to wrong results, AND exchanging subsets of x  (size(subset) > 6*10^6) led to wrong results too.

My troubles seem to be linked to the size of the array that is exchanged across images on different nodes. So, am I doing something wrong? Could it be a bug?

I use ifort 17.0.0 with -coarray=distributed.

Here is the program that mimicks the problem (it may be stupid, with too many sync all, .... , but it is to replicate my issue):

program testcoarray
 implicit none
 integer(kind=4)::i,j,k,neq
 integer(kind=4)::startrow
  • ,endrow
  • real(kind=8)::val
  • real(kind=8),allocatable::x(:)[:] neq=1000000 neq=25806732 if(this_image().eq.1)then write(*,'(/a,i0)')' Size of the array: ',neq write(*,'(a,i0/)')' Number of images : ',num_images() endif !INITIALISATION i=neq/num_images() startrow=(this_image()-1)*i+1 endrow=this_image()*i if(this_image().eq.num_images())endrow=neq allocate(x(neq)
  • ) sync all !FIRST UPDATE x=0.d0 x(startrow:endrow)=real(this_image(),8) sync all if(this_image().eq.1)then do i=2,num_images() x=x+x(:) enddo write(*,*)' First update : ',sum(x) endif sync all !SECOND UPDATE x=0.d0 x(startrow:endrow)=real(this_image(),8) sync all if(this_image().eq.1)then do i=2,num_images() j=startrow k=endrow x(j:k)=x(j:k)+x(j:k) enddo write(*,*)' Second update: ',sum(x) endif sync all !CORRECT ANSWER x=0.d0 x(startrow:endrow)=real(this_image(),8) val=sum(x) sync all if(this_image().eq.1)then do i=2,num_images() val=val+val enddo write(*,*)' Correct value: ',val endif sync all end program
  •  

    And here are the output for neq=1000000

    *With all images on the same node:

     Size of the array: 1000000
     Number of images : 4

      First update :    2500000.00000000     
      Second update:    2500000.00000000     
      Correct value:    2500000.00000000

    *With each image on a different node:

     Size of the array: 1000000
     Number of images : 4

      First update :    750000.000000000     
      Second update:    2500000.00000000     
      Correct value:    2500000.00000000 

    And here are the output for neq=25806732

    *With all images on the same node:

     Size of the array: 25806732
     Number of images : 4

      First update :    64516830.0000000     
      Second update:    64516830.0000000     
      Correct value:    64516830.0000000 

    *With each image on a different node:

     Size of the array: 25806732
     Number of images : 4

      First update :    19355049.0000000     
      Second update:    6451727.00000000     
      Correct value:    64516830.0000000     

     

    In advance thank you for your help.
     
    Jeremie

     

     

     

     

     

    0 Kudos
    4 Replies
    Michael_S_17
    New Contributor I
    505 Views

    Hi,
    I did test your program successfully with gfortran 8.0.1 (experimental version) and OpenCoarrays 2.0.0 on a shared memory laptop computer. The results are:

     Size of the array: 1000000
     Number of images : 4
    
      First update :    2500000.0000000000     
      Second update:    2500000.0000000000     
      Correct value:    2500000.0000000000    

    and

     Size of the array: 25806732
     Number of images : 4
    
      First update :    64516830.000000000     
      Second update:    64516830.000000000     
      Correct value:    64516830.000000000     
    

    From this, I would say your program seems to be correct. Could be a compiler bug. But I would also ask what values the this_image() and num_images() intrinsics do give with your program executing each image on different computing nodes?

    0 Kudos
    Jeremie_V_
    Beginner
    505 Views

    Thank you Michael S. for your tests.

    Regarding this_image() and num_images() on different compute nodes, both intrinsics give the expected values (i.e., num_images() returns 4 on all nodes, and this_images returns the ID of the image (from 1 to 4)). I tested it with success.

    I will install OpenCoarray and test my test program on our HPC, before reporting a potential bug... 

    0 Kudos
    Michael_S_17
    New Contributor I
    505 Views

    If you want to install (actually, it is not necessarily required to install it) OpenCoarrays on a cluster you may be required to use a simple 'trick', as it is described here:
    https://groups.google.com/forum/#!topic/opencoarrays/sdUECeRNJo8
    In case you need help with the installation, feel free to ask at the OpenCoarrays forum: https://groups.google.com/forum/#!forum/opencoarrays/join

    cheers

    0 Kudos
    Jeremie_V_
    Beginner
    506 Views

    I installed OpenCoarrays using gcc 7.1.0 with the trick from your link (Thank you Michael S. for the trick!), compiled my program, and tested it on the HPC.

    I assigned one image per node, and got the correct result!

     Size of the array: 25806732
     Number of images : 4

      First update :    64516830.000000000     
      Second update:    64516830.000000000     
      Correct value:    64516830.000000000    

    So, it really seems to be a bug of the the Intel compiler 17.0.0!

    Thank you Michael S. for your help!

    Jeremie

    0 Kudos
    Reply