- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have noticed severe slow-downs using compiled coarray Fortran code with Parallel Studio XE 2018 versus Parallel Studio XE 2015.
I wrote a minimal working example to demonstrate the issue, caf_test.f90, attached. The program generates a vector of pseudo-random numbers of length K.
program caf_test use omp_lib IMPLICIT NONE integer :: nsim INTEGER, PARAMETER :: K=29 ! size of array to be populated INTEGER :: nthreads INTEGER, DIMENSION(K) :: image_ind INTEGER :: mrem, mchunk INTEGER :: tid, offset INTEGER :: i_start, i_end INTEGER, ALLOCATABLE, DIMENSION(:,:) :: seq INTEGER :: ii, incr, n, jj, isim REAL(kind=8), DIMENSION(K), CODIMENSION
- Lines 49 to 60, an array is created that is common across images. The array assigns images a slice of the vector of random numbers
- Lines 65 to 74, each image is given image-specific variables that specifies its slice of the vector of random numbers.
- Lines 79 to 90, each image generates random numbers from a separate seed, and then populates its section of the array with random numbers.
- Lines 79 to 90 are repeated "nsim" times
- The total time is reported at line 97.
I wrote a bash script to compile (and declare the appropriate environmental variables) using the 2015 and 2018 compilers:
#!/bin/bash echo "Using 2015 Intel Compiler..." tmpMKLROOT=/opt/intel/composer_xe_2015.3.187/mkl source /opt/intel/composer_xe_2015.3.187/mkl/bin/intel64/mklvars_intel64.sh source /opt/intel/parallel_studio_xe_2015/psxevars.sh intel64 &> /dev/null /opt/intel/composer_xe_2015.3.187/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \ caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe ./caf_test.exe echo "Using 2018 Intel Compiler..." tmpMKLROOT=/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl source /opt/intel/parallel_studio_xe_2018/psxevars.sh intel64 &> /dev/null source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 /opt/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \ caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe ./caf_test.exe
With five images, the code takes about 4.5 times as long to run using the 2018 compiler.
What am I doing wrong here? I am curious why there would be such performance differences across compilers.
Thanks!
Chris
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quick comment: you can play around with "K" and "nsim". I have previously set "nsim" to 1 and "K" to 29000000, and I have similar issues with image communication.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another update: the code compiled with the 2019 package performs similarly to the 2018 package. That is, the 2015 compiler still produces faster coarray code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I did test your program using OpenCoarrays/gfortran on a laptop computer, using 5 coarray images. (I did replace the references to omp_get_wtime() by calls to the
cpu_time intrinsic).
The program causes a parallel slowdown with the last two SYNC ALL statements, each getting executed a thousand times in the outer do loop. On my computer each statement requires approximately 150 seconds to complete this. The data transfers through the fill_in coarray in the inner loop (getting executed 5000 times), on the other hand, does not seem to produce any parallel slow down at all.
Therefore, with your example program I would rather suspect the SYNC ALL statement as a possible cause for performance differences across compilers (assuming that everything else is exactly the same).
Best Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Michael, I agree that the "sync all" statements are a definite source of the slowdown!
Although just to be sure, I edited the program to remove the two sync statements. To show that the "sync all" statements are necessary, I wrote a loop from lines 117 to 124 where I check that the coarray is the same across images (for a subset of coarray elements).
The 2019 compiler is slower even without the "sync all" statements, and the "sync all" statements are indeed necessary.
So, is there something wrong with my implementation of coarrays, or has this feature of the Intel Compiler gotten slower over time? I would be surprised if the latter were true.
Chris
Edited f90 code, bash shell script, and output below (and attached in .tar.gz)
Coarray fortran program:
program caf_test IMPLICIT NONE integer :: nsim INTEGER, PARAMETER :: K=2900 ! size of array to be populated INTEGER :: nthreads INTEGER, DIMENSION(K) :: image_ind INTEGER :: mrem, mchunk INTEGER :: tid, offset INTEGER :: i_start, i_end INTEGER, ALLOCATABLE, DIMENSION(:,:) :: seq INTEGER :: ii, incr, n, jj, isim, kk REAL(kind=8), DIMENSION(K), CODIMENSION
Bash script to compile: #!/bin/bash tmpMKLROOT=/opt/intel/composer_xe_2015.3.187/mkl source /opt/intel/composer_xe_2015.3.187/mkl/bin/intel64/mklvars_intel64.sh source /opt/intel/parallel_studio_xe_2015/psxevars.sh intel64 &> /dev/null /opt/intel/composer_xe_2015.3.187/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \ caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe echo "Using 2015 Intel Compiler, sync statements in place..." ./caf_test.exe sync echo "Using 2015 Intel Compiler, sync statements removed..." ./caf_test.exe tmpMKLROOT=/opt/intel/compilers_and_libraries_2018.3.222/linux/mkl source /opt/intel/parallel_studio_xe_2018/psxevars.sh intel64 &> /dev/null source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64 /opt/intel/compilers_and_libraries_2018.3.222/linux/bin/intel64/ifort -I$tmpMKLROOT/include -coarray -coarray-num-images=5 \ caf_test.f90 -L$tmpMKLROOT/lib/intel64 -lmkl_rt -lpthread -lm -liomp5 -o caf_test.exe echo "Using 2019 Intel Compiler, sync statements in place..." ./caf_test.exe sync echo "Using 2019 Intel Compiler, sync statements removed..." ./caf_test.exe
Output:
Using 2015 Intel Compiler, sync statements in place... ("sync all" statements in place) Hello from image 1 out of 5 total images Hello from image 2 out of 5 total images Hello from image 5 out of 5 total images Hello from image 4 out of 5 total images Hello from image 3 out of 5 total images Image 1 handles range 1 580 Image 2 handles range 581 1160 Image 3 handles range 1161 1740 Image 4 handles range 1741 2320 Image 5 handles range 2321 2900 ---------------------------------- Coarray synchronized across images in 5.635999999999999E-003 seconds ---------------------------------- Coarrays are the same across images: T ============================================ Using 2015 Intel Compiler, sync statements removed... ("sync all" statements removed) Hello from image 1 out of 5 total images Hello from image 4 out of 5 total images Hello from image 5 out of 5 total images Hello from image 2 out of 5 total images Hello from image 3 out of 5 total images Image 1 handles range 1 580 Image 2 handles range 581 1160 Image 3 handles range 1161 1740 Image 4 handles range 1741 2320 Image 5 handles range 2321 2900 ---------------------------------- Coarray synchronized across images in 4.562000000000000E-003 seconds ---------------------------------- Coarrays are the same across images: F ============================================ Using 2019 Intel Compiler, sync statements in place... ("sync all" statements in place) Hello from image 1 out of 5 total images Hello from image 5 out of 5 total images Hello from image 4 out of 5 total images Hello from image 3 out of 5 total images Hello from image 2 out of 5 total images Image 1 handles range 1 580 Image 2 handles range 581 1160 Image 3 handles range 1161 1740 Image 4 handles range 1741 2320 Image 5 handles range 2321 2900 ---------------------------------- Coarray synchronized across images in 13.4575210000000 seconds ---------------------------------- Coarrays are the same across images: T ============================================ Using 2019 Intel Compiler, sync statements removed... ("sync all" statements removed) Hello from image 1 out of 5 total images Hello from image 4 out of 5 total images Hello from image 3 out of 5 total images Hello from image 5 out of 5 total images Hello from image 2 out of 5 total images Image 1 handles range 1 580 Image 2 handles range 581 1160 Image 3 handles range 1161 1740 Image 4 handles range 1741 2320 Image 5 handles range 2321 2900 ---------------------------------- Coarray synchronized across images in 2.42606500000000 seconds ---------------------------------- Coarrays are the same across images: F ============================================
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page