Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

Distributed Coarray Fortran: misunderstanding/bug?

Jeremie_V_
Beginner
188 Views

Dear,

Below is a Coarray Fortran program that gives me some troubles:

A large vector (x) is updated in two different ways. For large sizes of x, the updates of x is wrong WHEN each image is on a different node. WHEN all images are on the same node, results are always fine, whatever the size of x.

When size(x)=10^6, exchanging the full array across images on different nodes led to wrong results. However, exchanging small subsets of x led to correct results.

When size(x)>2*10^7, exchanging the full array across images on different nodes led to wrong results, AND exchanging subsets of x  (size(subset) > 6*10^6) led to wrong results too.

My troubles seem to be linked to the size of the array that is exchanged across images on different nodes. So, am I doing something wrong? Could it be a bug?

I use ifort 17.0.0 with -coarray=distributed.

Here is the program that mimicks the problem (it may be stupid, with too many sync all, .... , but it is to replicate my issue):

program testcoarray
 implicit none
 integer(kind=4)::i,j,k,neq
 integer(kind=4)::startrow
  • ,endrow
  • real(kind=8)::val
  • real(kind=8),allocatable::x(:)[:] neq=1000000 neq=25806732 if(this_image().eq.1)then write(*,'(/a,i0)')' Size of the array: ',neq write(*,'(a,i0/)')' Number of images : ',num_images() endif !INITIALISATION i=neq/num_images() startrow=(this_image()-1)*i+1 endrow=this_image()*i if(this_image().eq.num_images())endrow=neq allocate(x(neq)
  • ) sync all !FIRST UPDATE x=0.d0 x(startrow:endrow)=real(this_image(),8) sync all if(this_image().eq.1)then do i=2,num_images() x=x+x(:) enddo write(*,*)' First update : ',sum(x) endif sync all !SECOND UPDATE x=0.d0 x(startrow:endrow)=real(this_image(),8) sync all if(this_image().eq.1)then do i=2,num_images() j=startrow k=endrow x(j:k)=x(j:k)+x(j:k) enddo write(*,*)' Second update: ',sum(x) endif sync all !CORRECT ANSWER x=0.d0 x(startrow:endrow)=real(this_image(),8) val=sum(x) sync all if(this_image().eq.1)then do i=2,num_images() val=val+val enddo write(*,*)' Correct value: ',val endif sync all end program
  •  

    And here are the output for neq=1000000

    *With all images on the same node:

     Size of the array: 1000000
     Number of images : 4

      First update :    2500000.00000000     
      Second update:    2500000.00000000     
      Correct value:    2500000.00000000

    *With each image on a different node:

     Size of the array: 1000000
     Number of images : 4

      First update :    750000.000000000     
      Second update:    2500000.00000000     
      Correct value:    2500000.00000000 

    And here are the output for neq=25806732

    *With all images on the same node:

     Size of the array: 25806732
     Number of images : 4

      First update :    64516830.0000000     
      Second update:    64516830.0000000     
      Correct value:    64516830.0000000 

    *With each image on a different node:

     Size of the array: 25806732
     Number of images : 4

      First update :    19355049.0000000     
      Second update:    6451727.00000000     
      Correct value:    64516830.0000000     

     

    In advance thank you for your help.
     
    Jeremie

     

     

     

     

     

    0 Kudos
    4 Replies
    Michael_S_17
    Novice
    188 Views

    Hi,
    I did test your program successfully with gfortran 8.0.1 (experimental version) and OpenCoarrays 2.0.0 on a shared memory laptop computer. The results are:

     Size of the array: 1000000
     Number of images : 4
    
      First update :    2500000.0000000000     
      Second update:    2500000.0000000000     
      Correct value:    2500000.0000000000    

    and

     Size of the array: 25806732
     Number of images : 4
    
      First update :    64516830.000000000     
      Second update:    64516830.000000000     
      Correct value:    64516830.000000000     
    

    From this, I would say your program seems to be correct. Could be a compiler bug. But I would also ask what values the this_image() and num_images() intrinsics do give with your program executing each image on different computing nodes?

    Jeremie_V_
    Beginner
    188 Views

    Thank you Michael S. for your tests.

    Regarding this_image() and num_images() on different compute nodes, both intrinsics give the expected values (i.e., num_images() returns 4 on all nodes, and this_images returns the ID of the image (from 1 to 4)). I tested it with success.

    I will install OpenCoarray and test my test program on our HPC, before reporting a potential bug... 

    Michael_S_17
    Novice
    188 Views

    If you want to install (actually, it is not necessarily required to install it) OpenCoarrays on a cluster you may be required to use a simple 'trick', as it is described here:
    https://groups.google.com/forum/#!topic/opencoarrays/sdUECeRNJo8
    In case you need help with the installation, feel free to ask at the OpenCoarrays forum: https://groups.google.com/forum/#!forum/opencoarrays/join

    cheers

    Jeremie_V_
    Beginner
    188 Views

    I installed OpenCoarrays using gcc 7.1.0 with the trick from your link (Thank you Michael S. for the trick!), compiled my program, and tested it on the HPC.

    I assigned one image per node, and got the correct result!

     Size of the array: 25806732
     Number of images : 4

      First update :    64516830.000000000     
      Second update:    64516830.000000000     
      Correct value:    64516830.000000000    

    So, it really seems to be a bug of the the Intel compiler 17.0.0!

    Thank you Michael S. for your help!

    Jeremie

    Reply