Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Corrays in 2015 vs. 2018 Compiler

Christopher_H_4
Beginner
372 Views

I have noticed severe slow-downs using compiled coarray Fortran code with Parallel Studio XE 2018 versus Parallel Studio XE 2015.

I wrote a minimal working example to demonstrate the issue, caf_test.f90, attached. The program generates a vector of pseudo-random numbers of length K.

program caf_test
  use omp_lib

  IMPLICIT NONE
  integer :: nsim
  INTEGER, PARAMETER :: K=29  ! size of array to be populated
  INTEGER :: nthreads
  INTEGER, DIMENSION(K) :: image_ind
  INTEGER :: mrem, mchunk
  INTEGER :: tid, offset
  INTEGER :: i_start, i_end
  INTEGER, ALLOCATABLE, DIMENSION(:,:) :: seq
  INTEGER :: ii, incr, n, jj, isim
  REAL(kind=8), DIMENSION(K), CODIMENSION
  • :: fill_in ! only shared object INTEGER, ALLOCATABLE, DIMENSION(:) :: seed real(kind=8) :: tstart, tfinish logical :: display=.true. nsim = 10000 if (display) then write(*,*) "Hello from image ", this_image(), "out of ", num_images()," total images" end if sync all nthreads = num_images() tid = this_image() if (display.and.tid.eq.1) print *, '' call random_seed(size=n) IF (ALLOCATED(seed)) DEALLOCATE(seed) ALLOCATE(seed(n)) seed = 0 seed = 19840317*this_image() ! each image gets own seed call random_seed(put=seed) IF (ALLOCATED(seq)) DEALLOCATE(seq) ALLOCATE(seq(nthreads,2)) mrem = mod(K,nthreads) mchunk = (K-mrem)/nthreads ! create local record of image-specific indices in 'seq' create_seq: do ii=1,nthreads IF (ii-1.LT.mrem) THEN incr = mchunk + 1 offset = 1 seq(ii,1) = (ii-1)*incr + offset seq(ii,2) = seq(ii,1) + incr -1 ELSE ! last image does less work incr = mchunk offset = mrem + 1 seq(ii,1) = (ii-1)*incr + offset seq(ii,2) = seq(ii,1) + incr -1 END IF if (tid.eq.1.and.display) write (*,*) 'Image ', ii, 'handles range', seq(ii,:) end do create_seq ! indices specific to each image if (tid-1.LT.mrem) then incr = mchunk + 1 offset = 1 else incr = mchunk offset = mrem + 1 end if i_start = (tid-1)*incr + offset i_end = i_start + incr - 1 sync all tstart = omp_get_wtime() busy_work: do isim=1,nsim fill_in = 0.0 ! 0.0 on all images call random_number(fill_in(i_start:i_end)) ! non-overlapping areas across images where array is populated ! synchronize coarray across images sync all ! necessary talk: do ii=1,nthreads fill_in(seq(ii,1):seq(ii,2)) = fill_in(seq(ii,1):seq(ii,2))[ii] ! this is the only data transfer statement end do talk sync all end do busy_work tfinish = omp_get_wtime() if (display.and.tid.eq.1) then write (*,*), '' write (*,*), '----------------------------------' write (*,*), 'Coarray synchronized across images in', tfinish-tstart, 'seconds' write (*,*), '----------------------------------' end if end program caf_test
    • Lines 49 to 60, an array is created that is common across images. The array assigns images a slice of the vector of random numbers
    • Lines 65 to 74, each image is given image-specific variables that specifies its slice of the vector of random numbers.
    • Lines 79 to 90, each image generates random numbers from a separate seed, and then populates its section of the array with random numbers.
      • Lines 79 to 90 are repeated "nsim" times
    • The total time is reported at line 97.

    I wrote a bash script to compile (and declare the appropriate environmental variables) using the 2015 and 2018 compilers:

    #!/bin/bash
    
    echo "Using 2015 Intel Compiler..."
    tmpMKLROOT=/opt/intel/composer_xe_2015.3.187/mkl
    source /opt/intel/composer_xe_2015.3.187/mkl/bin/intel64/mklvars_intel64.sh
    source /opt/intel/parallel_studio_xe_2015/psxevars.sh intel64 &> /dev/null
    
    
    /opt/intel/composer_xe_2015.3.187/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
    	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe
    
    ./caf_test.exe
    
    
    echo "Using 2018 Intel Compiler..."
    tmpMKLROOT=/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl
    source /opt/intel/parallel_studio_xe_2018/psxevars.sh intel64 &> /dev/null
    source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
    
    /opt/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
    	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe
    
    ./caf_test.exe
    

    With five images, the code takes about 4.5 times as long to run using the 2018 compiler.

    What am I doing wrong here? I am curious why there would be such performance differences across compilers.

    Thanks!

    Chris

    0 Kudos
    4 Replies
    Christopher_H_4
    Beginner
    372 Views

    Quick comment: you can play around with "K" and "nsim". I have previously set "nsim" to 1 and "K" to 29000000, and I have similar issues with image communication.

    0 Kudos
    Christopher_H_4
    Beginner
    372 Views

    Another update: the code compiled with the 2019 package performs similarly to the 2018 package. That is, the 2015 compiler still produces faster coarray code.

    0 Kudos
    Michael_S_17
    New Contributor I
    372 Views

    Hi,
    I did test your program using OpenCoarrays/gfortran on a laptop computer, using 5 coarray images. (I did replace the references to omp_get_wtime() by calls to the cpu_time intrinsic).
    The program causes a parallel slowdown with the last two SYNC ALL statements, each getting executed a thousand times in the outer do loop. On my computer each statement requires approximately 150 seconds to complete this. The data transfers through the fill_in coarray in the inner loop (getting executed 5000 times), on the other hand, does not seem to produce any parallel slow down at all.
    Therefore, with your example program I would rather suspect the SYNC ALL statement as a possible cause for performance differences across compilers (assuming that everything else is exactly the same).

    Best Regards

    0 Kudos
    Christopher_H_4
    Beginner
    372 Views

    Thanks Michael, I agree that the "sync all" statements are a definite source of the slowdown!

    Although just to be sure, I edited the program to remove the two sync statements. To show that the "sync all" statements are necessary, I wrote a loop from lines 117 to 124 where I check that the coarray is the same across images (for a subset of coarray elements).

    The 2019 compiler is slower even without the "sync all" statements, and the "sync all" statements are indeed necessary.

    So, is there something wrong with my implementation of coarrays, or has this feature of the Intel Compiler gotten slower over time? I would be surprised if the latter were true.

    Chris

     

    Edited f90 code, bash shell script, and output below (and attached in .tar.gz)

    Coarray fortran program:

    program caf_test
    
      IMPLICIT NONE
      integer :: nsim
      INTEGER, PARAMETER :: K=2900  ! size of array to be populated
      INTEGER :: nthreads
      INTEGER, DIMENSION(K) :: image_ind
      INTEGER :: mrem, mchunk
      INTEGER :: tid, offset
      INTEGER :: i_start, i_end
      INTEGER, ALLOCATABLE, DIMENSION(:,:) :: seq
      INTEGER :: ii, incr, n, jj, isim, kk
      REAL(kind=8), DIMENSION(K), CODIMENSION
  • :: fill_in ! only shared object INTEGER, ALLOCATABLE, DIMENSION(:) :: seed real(kind=8) :: tstart, tfinish integer :: ca_cnt logical :: ca_diff logical :: display=.true. logical :: keep_sync character(LEN=32) :: cmdline nthreads = num_images() tid = this_image() call get_command_argument(1,cmdline) if (trim(cmdline).EQ.'sync') then keep_sync = .true. if (tid.eq.1) write (*,'(A20)') '' if (tid.eq.1) write (*,'(A32)') '("sync all" statements in place)' if (tid.eq.1) write (*,'(A20)') '' else keep_sync = .false. if (tid.eq.1) write (*,'(A20)') '' if (tid.eq.1) write (*,'(A31)') '("sync all" statements removed)' if (tid.eq.1) write (*,'(A20)') '' end if nsim = 100 if (display) then write(*,*) "Hello from image ", this_image(), "out of ", num_images()," total images" end if sync all if (display.and.tid.eq.1) print *, '' call random_seed(size=n) IF (ALLOCATED(seed)) DEALLOCATE(seed) ALLOCATE(seed(n)) seed = 0 seed = 19840317*this_image() ! each image gets own seed call random_seed(put=seed) IF (ALLOCATED(seq)) DEALLOCATE(seq) ALLOCATE(seq(nthreads,2)) mrem = mod(K,nthreads) mchunk = (K-mrem)/nthreads ! create local record of image-specific indices in 'seq' create_seq: do ii=1,nthreads IF (ii-1.LT.mrem) THEN incr = mchunk + 1 offset = 1 seq(ii,1) = (ii-1)*incr + offset seq(ii,2) = seq(ii,1) + incr -1 ELSE ! last image does less work incr = mchunk offset = mrem + 1 seq(ii,1) = (ii-1)*incr + offset seq(ii,2) = seq(ii,1) + incr -1 END IF if (tid.eq.1.and.display) write (*,*) 'Image ', ii, 'handles range', seq(ii,:) end do create_seq ! indices specific to each image if (tid-1.LT.mrem) then incr = mchunk + 1 offset = 1 else incr = mchunk offset = mrem + 1 end if i_start = (tid-1)*incr + offset i_end = i_start + incr - 1 sync all call cpu_time(tstart) busy_work: do isim=1,nsim fill_in = 0.0 ! 0.0 on all images call random_number(fill_in(i_start:i_end)) ! non-overlapping areas across images where array is populated if (keep_sync) sync all ! necessary ! synchronize coarray across images talk: do ii=1,nthreads fill_in(seq(ii,1):seq(ii,2)) = fill_in(seq(ii,1):seq(ii,2))[ii] ! this is the only data transfer statement end do talk end do busy_work if (keep_sync) sync all ! necessary call cpu_time(tfinish) ! detect differences across coarray copies ca_cnt = 0 do ii=1,nthreads do jj=1,nthreads do kk=1,nthreads ca_diff = (abs(fill_in(seq(kk,1))[ii]-fill_in(seq(kk,1))[jj].gt.1d-8)) if (ca_diff) ca_cnt = ca_cnt+1 end do end do end do ca_diff = (ca_cnt.eq.0) if (display.and.tid.eq.1) then write (*,*), '' write (*,*), '----------------------------------' write (*,*), 'Coarray synchronized across images in', tfinish-tstart, 'seconds' write (*,*), '----------------------------------' write (*,*), '' write (*,'(A37,L)'), 'Coarrays are the same across images:', ca_diff write (*,*), '' write (*,*) '' write (*,*) '============================================' write (*,*) '' write (*,*) '' end if end program caf_test
  • Bash script to compile:
    
    #!/bin/bash
    
    tmpMKLROOT=/opt/intel/composer_xe_2015.3.187/mkl
    source /opt/intel/composer_xe_2015.3.187/mkl/bin/intel64/mklvars_intel64.sh
    source /opt/intel/parallel_studio_xe_2015/psxevars.sh intel64 &> /dev/null
    
    
    /opt/intel/composer_xe_2015.3.187/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
    	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe
    
    echo "Using 2015 Intel Compiler, sync statements in place..."
    ./caf_test.exe sync
    
    echo "Using 2015 Intel Compiler, sync statements removed..."
    ./caf_test.exe
    
    
    tmpMKLROOT=/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl
    source /opt/intel/parallel_studio_xe_2018/psxevars.sh intel64 &> /dev/null
    source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64
    
    /opt/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \
    	caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe
    
    echo "Using 2019 Intel Compiler, sync statements in place..."
    ./caf_test.exe sync
    
    echo "Using 2019 Intel Compiler, sync statements removed..."
    ./caf_test.exe
    

    Output:

    Using 2015 Intel Compiler, sync statements in place...
                        
    ("sync all" statements in place)
                        
     Hello from image            1 out of            5  total images
     Hello from image            2 out of            5  total images
     Hello from image            5 out of            5  total images
     Hello from image            4 out of            5  total images
     Hello from image            3 out of            5  total images
     
     Image            1 handles range           1         580
     Image            2 handles range         581        1160
     Image            3 handles range        1161        1740
     Image            4 handles range        1741        2320
     Image            5 handles range        2321        2900
     
     ----------------------------------
     Coarray synchronized across images in  5.635999999999999E-003 seconds
     ----------------------------------
     
     Coarrays are the same across images: T
     
     
     ============================================
     
     
    Using 2015 Intel Compiler, sync statements removed...
                        
    ("sync all" statements removed)
                        
     Hello from image            1 out of            5  total images
     Hello from image            4 out of            5  total images
     Hello from image            5 out of            5  total images
     Hello from image            2 out of            5  total images
     Hello from image            3 out of            5  total images
     
     Image            1 handles range           1         580
     Image            2 handles range         581        1160
     Image            3 handles range        1161        1740
     Image            4 handles range        1741        2320
     Image            5 handles range        2321        2900
     
     ----------------------------------
     Coarray synchronized across images in  4.562000000000000E-003 seconds
     ----------------------------------
     
     Coarrays are the same across images: F
     
     
     ============================================
     
     
    Using 2019 Intel Compiler, sync statements in place...
                        
    ("sync all" statements in place)
                        
     Hello from image            1 out of            5  total images
     Hello from image            5 out of            5  total images
     Hello from image            4 out of            5  total images
     Hello from image            3 out of            5  total images
     Hello from image            2 out of            5  total images
     
     Image            1 handles range           1         580
     Image            2 handles range         581        1160
     Image            3 handles range        1161        1740
     Image            4 handles range        1741        2320
     Image            5 handles range        2321        2900
     
     ----------------------------------
     Coarray synchronized across images in   13.4575210000000      seconds
     ----------------------------------
     
     Coarrays are the same across images: T
     
     
     ============================================
     
     
    Using 2019 Intel Compiler, sync statements removed...
                        
    ("sync all" statements removed)
                        
     Hello from image            1 out of            5  total images
     Hello from image            4 out of            5  total images
     Hello from image            3 out of            5  total images
     Hello from image            5 out of            5  total images
     Hello from image            2 out of            5  total images
     
     Image            1 handles range           1         580
     Image            2 handles range         581        1160
     Image            3 handles range        1161        1740
     Image            4 handles range        1741        2320
     Image            5 handles range        2321        2900
     
     ----------------------------------
     Coarray synchronized across images in   2.42606500000000      seconds
     ----------------------------------
     
     Coarrays are the same across images: F
     
     
     ============================================

     

    0 Kudos
    Reply