Another update: the code

Christopher_H_4 · ‎09-23-2018

I have noticed severe slow-downs using compiled coarray Fortran code with Parallel Studio XE 2018 versus Parallel Studio XE 2015.

I wrote a minimal working example to demonstrate the issue, caf_test.f90, attached. The program generates a vector of pseudo-random numbers of length K.

program caf_test
  use omp_lib

  IMPLICIT NONE
  integer :: nsim
  INTEGER, PARAMETER :: K=29  ! size of array to be populated
  INTEGER :: nthreads
  INTEGER, DIMENSION(K) :: image_ind
  INTEGER :: mrem, mchunk
  INTEGER :: tid, offset
  INTEGER :: i_start, i_end
  INTEGER, ALLOCATABLE, DIMENSION(:,:) :: seq
  INTEGER :: ii, incr, n, jj, isim
  REAL(kind=8), DIMENSION(K), CODIMENSION :: fill_in  ! only shared object
  INTEGER, ALLOCATABLE, DIMENSION(:) :: seed
  real(kind=8) :: tstart, tfinish
  logical :: display=.true.

  nsim = 10000

  if (display) then
    write(*,*) "Hello from image ", this_image(), "out of ", num_images()," total images"
  end if
  sync all



  nthreads = num_images()
  tid      = this_image()

  if (display.and.tid.eq.1) print *, ''

  call random_seed(size=n)
  IF (ALLOCATED(seed)) DEALLOCATE(seed)
  ALLOCATE(seed(n))
  seed = 0

  seed = 19840317*this_image() ! each image gets own seed
  call random_seed(put=seed)


  IF (ALLOCATED(seq)) DEALLOCATE(seq)
  ALLOCATE(seq(nthreads,2))

  mrem   = mod(K,nthreads)
  mchunk = (K-mrem)/nthreads

  ! create local record of image-specific indices in 'seq'
  create_seq: do ii=1,nthreads
    IF (ii-1.LT.mrem) THEN
      incr      = mchunk + 1
      offset    = 1
      seq(ii,1) = (ii-1)*incr + offset
      seq(ii,2) = seq(ii,1) + incr -1
    ELSE  ! last image does less work
      incr      = mchunk
      offset    = mrem + 1
      seq(ii,1) = (ii-1)*incr + offset
      seq(ii,2) = seq(ii,1) + incr -1
    END IF 

    if (tid.eq.1.and.display) write (*,*) 'Image ', ii, 'handles range', seq(ii,:)
  end do create_seq

  ! indices specific to each image
  if (tid-1.LT.mrem) then
    incr   = mchunk + 1
    offset = 1
  else
    incr   = mchunk
    offset = mrem + 1
  end if
  i_start  = (tid-1)*incr + offset
  i_end    = i_start + incr - 1

  sync all

  tstart = omp_get_wtime()
  busy_work: do isim=1,nsim
    fill_in = 0.0  ! 0.0 on all images
    call random_number(fill_in(i_start:i_end)) ! non-overlapping areas across images where array is populated
  
    ! synchronize coarray across images
    sync all  ! necessary
    talk: do ii=1,nthreads
      fill_in(seq(ii,1):seq(ii,2)) = fill_in(seq(ii,1):seq(ii,2))[ii] ! this is the only data transfer statement
    end do talk
    sync all

  end do busy_work

  tfinish = omp_get_wtime()

  if (display.and.tid.eq.1) then
    write (*,*), ''
    write (*,*), '----------------------------------'
    write (*,*), 'Coarray synchronized across images in', tfinish-tstart, 'seconds'
    write (*,*), '----------------------------------'
  end if

end program caf_test

Lines 49 to 60, an array is created that is common across images. The array assigns images a slice of the vector of random numbers
Lines 65 to 74, each image is given image-specific variables that specifies its slice of the vector of random numbers.
Lines 79 to 90, each image generates random numbers from a separate seed, and then populates its section of the array with random numbers.
- Lines 79 to 90 are repeated "nsim" times
The total time is reported at line 97.

I wrote a bash script to compile (and declare the appropriate environmental variables) using the 2015 and 2018 compilers:

#!/bin/bash

echo "Using 2015 Intel Compiler..."
tmpMKLROOT=/opt/intel/composer_xe_2015.3.187/mkl
source /opt/intel/composer_xe_2015.3.187/mkl/bin/intel64/mklvars_intel64.sh
source /opt/intel/parallel_studio_xe_2015/psxevars.sh intel64 &> /dev/null


/opt/intel/composer_xe_2015.3.187/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe

./caf_test.exe


echo "Using 2018 Intel Compiler..."
tmpMKLROOT=/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl
source /opt/intel/parallel_studio_xe_2018/psxevars.sh intel64 &> /dev/null
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64

/opt/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe

./caf_test.exe

With five images, the code takes about 4.5 times as long to run using the 2018 compiler.

What am I doing wrong here? I am curious why there would be such performance differences across compilers.

Thanks!

Chris

Christopher_H_4 · ‎09-23-2018

Quick comment: you can play around with "K" and "nsim". I have previously set "nsim" to 1 and "K" to 29000000, and I have similar issues with image communication.

Christopher_H_4 · ‎09-24-2018

Another update: the code compiled with the 2019 package performs similarly to the 2018 package. That is, the 2015 compiler still produces faster coarray code.

Michael_S_17 · ‎09-24-2018

Hi,
I did test your program using OpenCoarrays/gfortran on a laptop computer, using 5 coarray images. (I did replace the references to omp_get_wtime() by calls to the cpu_time intrinsic).
The program causes a parallel slowdown with the last two SYNC ALL statements, each getting executed a thousand times in the outer do loop. On my computer each statement requires approximately 150 seconds to complete this. The data transfers through the fill_in coarray in the inner loop (getting executed 5000 times), on the other hand, does not seem to produce any parallel slow down at all.
Therefore, with your example program I would rather suspect the SYNC ALL statement as a possible cause for performance differences across compilers (assuming that everything else is exactly the same).

Best Regards

Christopher_H_4 · ‎09-26-2018

Thanks Michael, I agree that the "sync all" statements are a definite source of the slowdown!

Although just to be sure, I edited the program to remove the two sync statements. To show that the "sync all" statements are necessary, I wrote a loop from lines 117 to 124 where I check that the coarray is the same across images (for a subset of coarray elements).

The 2019 compiler is slower even without the "sync all" statements, and the "sync all" statements are indeed necessary.

So, is there something wrong with my implementation of coarrays, or has this feature of the Intel Compiler gotten slower over time? I would be surprised if the latter were true.

Chris

Edited f90 code, bash shell script, and output below (and attached in .tar.gz)

Coarray fortran program:

program caf_test

  IMPLICIT NONE
  integer :: nsim
  INTEGER, PARAMETER :: K=2900  ! size of array to be populated
  INTEGER :: nthreads
  INTEGER, DIMENSION(K) :: image_ind
  INTEGER :: mrem, mchunk
  INTEGER :: tid, offset
  INTEGER :: i_start, i_end
  INTEGER, ALLOCATABLE, DIMENSION(:,:) :: seq
  INTEGER :: ii, incr, n, jj, isim, kk
  REAL(kind=8), DIMENSION(K), CODIMENSION :: fill_in  ! only shared object
  INTEGER, ALLOCATABLE, DIMENSION(:) :: seed
  real(kind=8) :: tstart, tfinish
  integer :: ca_cnt
  logical :: ca_diff
  logical :: display=.true.
  logical :: keep_sync
  character(LEN=32) :: cmdline

  nthreads = num_images()
  tid      = this_image()

  call get_command_argument(1,cmdline)

  if (trim(cmdline).EQ.'sync') then
    keep_sync = .true.
    if (tid.eq.1) write (*,'(A20)') ''
    if (tid.eq.1) write (*,'(A32)') '("sync all" statements in place)'
    if (tid.eq.1) write (*,'(A20)') ''
  else
    keep_sync = .false.
    if (tid.eq.1) write (*,'(A20)') ''
    if (tid.eq.1) write (*,'(A31)') '("sync all" statements removed)'
    if (tid.eq.1) write (*,'(A20)') ''
  end if


  nsim = 100

  if (display) then
    write(*,*) "Hello from image ", this_image(), "out of ", num_images()," total images"
  end if
  sync all



  if (display.and.tid.eq.1) print *, ''

  call random_seed(size=n)
  IF (ALLOCATED(seed)) DEALLOCATE(seed)
  ALLOCATE(seed(n))
  seed = 0

  seed = 19840317*this_image() ! each image gets own seed
  call random_seed(put=seed)


  IF (ALLOCATED(seq)) DEALLOCATE(seq)
  ALLOCATE(seq(nthreads,2))

  mrem   = mod(K,nthreads)
  mchunk = (K-mrem)/nthreads

  ! create local record of image-specific indices in 'seq'
  create_seq: do ii=1,nthreads
    IF (ii-1.LT.mrem) THEN
      incr      = mchunk + 1
      offset    = 1
      seq(ii,1) = (ii-1)*incr + offset
      seq(ii,2) = seq(ii,1) + incr -1
    ELSE  ! last image does less work
      incr      = mchunk
      offset    = mrem + 1
      seq(ii,1) = (ii-1)*incr + offset
      seq(ii,2) = seq(ii,1) + incr -1
    END IF 

    if (tid.eq.1.and.display) write (*,*) 'Image ', ii, 'handles range', seq(ii,:)
  end do create_seq

  ! indices specific to each image
  if (tid-1.LT.mrem) then
    incr   = mchunk + 1
    offset = 1
  else
    incr   = mchunk
    offset = mrem + 1
  end if

  i_start  = (tid-1)*incr + offset
  i_end    = i_start + incr - 1

  sync all

  call cpu_time(tstart)
  busy_work: do isim=1,nsim
    fill_in = 0.0  ! 0.0 on all images
    call random_number(fill_in(i_start:i_end)) ! non-overlapping areas across images where array is populated
  
    if (keep_sync)  sync all  ! necessary

    ! synchronize coarray across images
    talk: do ii=1,nthreads
      fill_in(seq(ii,1):seq(ii,2)) = fill_in(seq(ii,1):seq(ii,2))[ii] ! this is the only data transfer statement
    end do talk

  end do busy_work

  if (keep_sync)  sync all  ! necessary


  call cpu_time(tfinish)
  ! detect differences across coarray copies
  ca_cnt = 0
  do ii=1,nthreads
    do jj=1,nthreads
      do kk=1,nthreads
        ca_diff = (abs(fill_in(seq(kk,1))[ii]-fill_in(seq(kk,1))[jj].gt.1d-8))
        if (ca_diff) ca_cnt = ca_cnt+1
      end do
    end do
  end do
  ca_diff = (ca_cnt.eq.0)

  if (display.and.tid.eq.1) then
    write (*,*), ''
    write (*,*), '----------------------------------'
    write (*,*), 'Coarray synchronized across images in', tfinish-tstart, 'seconds'
    write (*,*), '----------------------------------'
    write (*,*), ''
    write (*,'(A37,L)'), 'Coarrays are the same across images:', ca_diff
    write (*,*), ''
    write (*,*) ''
    write (*,*) '============================================'
    write (*,*) ''
    write (*,*) ''
  end if

end program caf_test

Bash script to compile:

#!/bin/bash

tmpMKLROOT=/opt/intel/composer_xe_2015.3.187/mkl
source /opt/intel/composer_xe_2015.3.187/mkl/bin/intel64/mklvars_intel64.sh
source /opt/intel/parallel_studio_xe_2015/psxevars.sh intel64 &> /dev/null


/opt/intel/composer_xe_2015.3.187/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe

echo "Using 2015 Intel Compiler, sync statements in place..."
./caf_test.exe sync

echo "Using 2015 Intel Compiler, sync statements removed..."
./caf_test.exe


tmpMKLROOT=/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl
source /opt/intel/parallel_studio_xe_2018/psxevars.sh intel64 &> /dev/null
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64

/opt/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe

echo "Using 2019 Intel Compiler, sync statements in place..."
./caf_test.exe sync

echo "Using 2019 Intel Compiler, sync statements removed..."
./caf_test.exe

Output:

Using 2015 Intel Compiler, sync statements in place...
                    
("sync all" statements in place)
                    
 Hello from image            1 out of            5  total images
 Hello from image            2 out of            5  total images
 Hello from image            5 out of            5  total images
 Hello from image            4 out of            5  total images
 Hello from image            3 out of            5  total images
 
 Image            1 handles range           1         580
 Image            2 handles range         581        1160
 Image            3 handles range        1161        1740
 Image            4 handles range        1741        2320
 Image            5 handles range        2321        2900
 
 ----------------------------------
 Coarray synchronized across images in  5.635999999999999E-003 seconds
 ----------------------------------
 
 Coarrays are the same across images: T
 
 
 ============================================
 
 
Using 2015 Intel Compiler, sync statements removed...
                    
("sync all" statements removed)
                    
 Hello from image            1 out of            5  total images
 Hello from image            4 out of            5  total images
 Hello from image            5 out of            5  total images
 Hello from image            2 out of            5  total images
 Hello from image            3 out of            5  total images
 
 Image            1 handles range           1         580
 Image            2 handles range         581        1160
 Image            3 handles range        1161        1740
 Image            4 handles range        1741        2320
 Image            5 handles range        2321        2900
 
 ----------------------------------
 Coarray synchronized across images in  4.562000000000000E-003 seconds
 ----------------------------------
 
 Coarrays are the same across images: F
 
 
 ============================================
 
 
Using 2019 Intel Compiler, sync statements in place...
                    
("sync all" statements in place)
                    
 Hello from image            1 out of            5  total images
 Hello from image            5 out of            5  total images
 Hello from image            4 out of            5  total images
 Hello from image            3 out of            5  total images
 Hello from image            2 out of            5  total images
 
 Image            1 handles range           1         580
 Image            2 handles range         581        1160
 Image            3 handles range        1161        1740
 Image            4 handles range        1741        2320
 Image            5 handles range        2321        2900
 
 ----------------------------------
 Coarray synchronized across images in   13.4575210000000      seconds
 ----------------------------------
 
 Coarrays are the same across images: T
 
 
 ============================================
 
 
Using 2019 Intel Compiler, sync statements removed...
                    
("sync all" statements removed)
                    
 Hello from image            1 out of            5  total images
 Hello from image            4 out of            5  total images
 Hello from image            3 out of            5  total images
 Hello from image            5 out of            5  total images
 Hello from image            2 out of            5  total images
 
 Image            1 handles range           1         580
 Image            2 handles range         581        1160
 Image            3 handles range        1161        1740
 Image            4 handles range        1741        2320
 Image            5 handles range        2321        2900
 
 ----------------------------------
 Coarray synchronized across images in   2.42606500000000      seconds
 ----------------------------------
 
 Coarrays are the same across images: F
 
 
 ============================================

Corrays in 2015 vs. 2018 Compiler