- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm following the coarray tutorial, https://www.intel.com/content/www/us/en/docs/fortran-compiler/tutorial-coarray/18-0/overview.html
I'm getting something strange... when I run the program with 2 images, it takes 12.2 seconds, and when I run with 4 images, it takes 17.2 seconds. With 16 images, it takes 17.3 seconds.
Why is the computation time bigger with 4 images than with 2? In the tutorial, the program runs in an 8-threads 4-cores processor, and runs faster with 8 images than 4. My computer has a 4-threads and 2-cores processor (this one: https://www.intel.com/content/www/us/en/products/sku/52229/intel-core-i52520m-processor-3m-cache-up-to-3-20-ghz/specifications.html).
So, why is it slower when I run with 4 images instead of 2?
This is my code for computing pi using coarray:
program mcpi_using_coarray
implicit none
real*8 , parameter :: actual_pi = 3.141592653589793238d0
integer*8 i , num_trials/600000000/ , clock_start,clock_end,clock_rate
real*8 x , y , computed_pi
integer*8 total[*]
integer seed_size
integer , allocatable :: seed_array(:)
if ( this_image()==1 ) then
if ( mod(num_trials,int(num_images(),8))/=0 ) &
error stop 'Erro'
print '(A,I0,A,I0,A)', "Computing pi using ",num_trials," trials across ",NUM_IMAGES()," images"
call SYSTEM_CLOCK(clock_start)
endif
call random_seed()
call random_seed(size=seed_size)
allocate(seed_array(seed_size))
call random_seed(get=seed_array)
seed_array(1)=seed_array(1)+this_image()*13
call random_seed(put=seed_array)
total = 0
do i = 1 , num_trials/num_images()
call random_number(x)
call random_number(y)
if ( (x*x)+(y*y) <= 1.d0 ) total = total + 1
enddo
sync all
if ( this_image()==1 ) then
do i = 2 , num_images()
total = total + total[i]
enddo
computed_pi = 4.*(real(total,8)/real(num_trials,8))
print '(A,G0.8,A,G0.3)', "Computed value of pi is ", computed_pi, &
", Relative Error: ",ABS((computed_pi-actual_pi)/actual_pi)
call SYSTEM_CLOCK(clock_end,clock_rate)
print '(A,G0.3,A)', "Elapsed time is ", &
REAL(clock_end-clock_start)/REAL(clock_rate)," seconds"
endif
endprogram
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>So, why is it slower when I run with 4 images instead of 2?
>>My computer has a 4-threads and 2-cores processor
Likely cause is each image (process) has approximately the same number of instructions to execute.
Your system has 2 cores, which when run as rank-2, is likely running in Turbo-Boost mode. Whereas, when run as rank-4, it appears to be running without Turbo-Boost. IOW more threads running - higher temperature/lower clock speed. In an ideal world where CPU speed is fixed, you would expect the same runtime regardless of the number of ranks used (taking into account the application startup overhead). Because your CPU has variable clock speeds, this is not the case.
Jim Dempsey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote that tutorial. The increased time is partly the startup time for the processes, but I think your PC is overloaded by that many processes.
An interesting observation I made at the time was that increasing the sample count didn't noticeably improve the approximation - I never did figure out why, though I had some theories. Since this was intended to be a tutorial on coarrays and not a mathematical treatise, I didn't spend too much time on the investigation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>So, why is it slower when I run with 4 images instead of 2?
>>My computer has a 4-threads and 2-cores processor
Likely cause is each image (process) has approximately the same number of instructions to execute.
Your system has 2 cores, which when run as rank-2, is likely running in Turbo-Boost mode. Whereas, when run as rank-4, it appears to be running without Turbo-Boost. IOW more threads running - higher temperature/lower clock speed. In an ideal world where CPU speed is fixed, you would expect the same runtime regardless of the number of ranks used (taking into account the application startup overhead). Because your CPU has variable clock speeds, this is not the case.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you, @Steve_Lionel, for that coarray tutorial. I just recently updated some of the text for it. I'm not sure when the update will be published.
This is the key message update:
The Intel® Fortran Compiler (ifx) and the Intel® Fortran Compiler Classic (ifort) support parallel programming using coarrays as defined in the Fortran 2008 Standard and extended by Fortran 2018.
I didn't look into the performance either. I just made sure the example works ok for both ifx and ifort.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, everyone

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page