Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
28451 Discussions

Number of images of a coarray program for computing pi

tim_theos
Beginner
572 Views

I'm following the coarray tutorial, https://www.intel.com/content/www/us/en/docs/fortran-compiler/tutorial-coarray/18-0/overview.html

 

I'm getting something strange... when I run the program with 2 images, it takes 12.2 seconds, and when I run with 4 images, it takes 17.2 seconds. With 16 images, it takes 17.3 seconds.

 

Why is the computation time bigger with 4 images than with 2? In the tutorial, the program runs in an 8-threads 4-cores processor, and runs faster with 8 images than 4. My computer has a 4-threads and 2-cores processor (this one: https://www.intel.com/content/www/us/en/products/sku/52229/intel-core-i52520m-processor-3m-cache-up-to-3-20-ghz/specifications.html).

 

So, why is it slower when I run with 4 images instead of 2?

 

This is my code for computing pi using coarray:

program mcpi_using_coarray
implicit none
real*8 , parameter :: actual_pi = 3.141592653589793238d0
integer*8 i , num_trials/600000000/ , clock_start,clock_end,clock_rate
real*8 x , y , computed_pi
integer*8 total[*]
integer seed_size
integer , allocatable :: seed_array(:)

if ( this_image()==1 ) then
if ( mod(num_trials,int(num_images(),8))/=0 ) &
error stop 'Erro'
print '(A,I0,A,I0,A)', "Computing pi using ",num_trials," trials across ",NUM_IMAGES()," images"
call SYSTEM_CLOCK(clock_start)
endif

call random_seed()
call random_seed(size=seed_size)
allocate(seed_array(seed_size))
call random_seed(get=seed_array)
seed_array(1)=seed_array(1)+this_image()*13
call random_seed(put=seed_array)

total = 0

do i = 1 , num_trials/num_images()
call random_number(x)
call random_number(y)
if ( (x*x)+(y*y) <= 1.d0 ) total = total + 1
enddo

sync all

if ( this_image()==1 ) then
do i = 2 , num_images()
total = total + total[i]
enddo
computed_pi = 4.*(real(total,8)/real(num_trials,8))
print '(A,G0.8,A,G0.3)', "Computed value of pi is ", computed_pi, &
", Relative Error: ",ABS((computed_pi-actual_pi)/actual_pi)
call SYSTEM_CLOCK(clock_end,clock_rate)
print '(A,G0.3,A)', "Elapsed time is ", &
REAL(clock_end-clock_start)/REAL(clock_rate)," seconds"
endif

endprogram

 

Labels (1)
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
553 Views

>>So, why is it slower when I run with 4 images instead of 2?

>>My computer has a 4-threads and 2-cores processor

Likely cause is each image (process) has approximately the same number of instructions to execute.

Your system has 2 cores, which when run as rank-2, is likely running in Turbo-Boost mode. Whereas, when run as rank-4, it appears to be running without Turbo-Boost. IOW more threads running - higher temperature/lower clock speed. In an ideal world where CPU speed is fixed, you would expect the same runtime regardless of the number of ranks used (taking into account the application startup overhead). Because your CPU has variable clock speeds, this is not the case.

 

Jim Dempsey

 

View solution in original post

4 Replies
Steve_Lionel
Honored Contributor III
563 Views

I wrote that tutorial. The increased time is partly the startup time for the processes, but I think your PC is overloaded by that many processes.

An interesting observation I made at the time was that increasing the sample count didn't noticeably improve the approximation - I never did figure out why, though I had some theories. Since this was intended to be a tutorial on coarrays and not a mathematical treatise, I didn't spend too much time on the investigation.

jimdempseyatthecove
Honored Contributor III
554 Views

>>So, why is it slower when I run with 4 images instead of 2?

>>My computer has a 4-threads and 2-cores processor

Likely cause is each image (process) has approximately the same number of instructions to execute.

Your system has 2 cores, which when run as rank-2, is likely running in Turbo-Boost mode. Whereas, when run as rank-4, it appears to be running without Turbo-Boost. IOW more threads running - higher temperature/lower clock speed. In an ideal world where CPU speed is fixed, you would expect the same runtime regardless of the number of ranks used (taking into account the application startup overhead). Because your CPU has variable clock speeds, this is not the case.

 

Jim Dempsey

 

Barbara_P_Intel
Moderator
551 Views

Thank you, @Steve_Lionel, for that coarray tutorial. I just recently updated some of the text for it. I'm not sure when the update will be published.

This is the key message update:

The Intel® Fortran Compiler (ifx) and the Intel® Fortran Compiler Classic (ifort) support parallel programming using coarrays as defined in the Fortran 2008 Standard and extended by Fortran 2018.

I didn't look into the performance either. I just made sure the example works ok for both ifx and ifort.

 

tim_theos
Beginner
517 Views

Thanks, everyone

 

0 Kudos
Reply