OpenMP Program taking longer with two threads than one thread

Pisani__Will · ‎07-03-2018

Hi! I wrote a coin flipping program in Fortran with OpenMP, but it's not working properly. I'm using the Parallel Studio XE 2018 compiler with Visual Studio as the IDE. The number of flips is 10E9 which takes a single thread ~200 seconds to run on my i7-7700 and two threads takes ~800 seconds to run. I do know that the overhead from running two threads could be large enough that that is the reason for it taking longer, but doesn't that usually happen if your program doesn't have enough computational intensity? Curiously, gfortran -O3 -fopenmp with two threads takes ~530 seconds. This is my first real Fortran program, but I'm very familiar with programming. Also, I wrote a parallel coin flipping function in Julia with number of flips at 10E9 that takes ~2 seconds with one thread, ~1.15 seconds with two threads, and ~0.84 seconds with four threads. I'm wondering why my Julia function is faster than my Fortran program. Is there anything I can do to optimize my code? Thanks!

Below is the code:

!  coin_flip_omp.f90 
!
!  FUNCTIONS:
!  coin_flip - Entry point of console application.
!

!****************************************************************************
!
!  PROGRAM: coin_flip
!
!  PURPOSE: A simple parallel coin flipping program
!
!****************************************************************************

    program coin_flip_omp
    ! Import libraries
    use omp_lib
    
    implicit none
    
    ! Variables
    integer, parameter :: MyLongIntType = selected_int_kind (12)
    integer (kind=MyLongIntType) :: num_of_flips = 10E9
    integer (kind=MyLongIntType) :: count = 0, i = 1, j
    integer :: proc_num, thread_num
    real :: x
    real :: seconds
    
    ! Begin timing program
    seconds = omp_get_wtime()
    
    ! Program statement
    print *, 'This program will flip a coin', num_of_flips, 'times and report on the number of heads'
   
    ! How many processors are available?
    proc_num = omp_get_num_procs()
    thread_num = 2
    
    call omp_set_num_threads(thread_num) 
    
    print *, 'Number of processors is    ', proc_num
    print *, 'Number of threads requested is ', thread_num
    
    ! Start while loop
    !$OMP PARALLEL DO
    DO j = 1, num_of_flips
        ! Flip the coin
        ! RANDOM_NUMBER returns a pseudo-random number between 0 and 1
        call RANDOM_NUMBER(x)
        IF (x < 0.5) THEN
            count = count + 1
        END IF
        i = i + 1 ! Increment counter by 1
    END DO
    !$OMP END PARALLEL DO
    
    ! End timing 
    seconds = omp_get_wtime() - seconds
    
    ! Print the number of heads
    print *, 'The number of heads is ', count
    print *, 'Time taken to run:', seconds, 'seconds'

    end program coin_flip_omp

Steve_Lionel · ‎07-03-2018

One issue I see is that calling RANDOM_NUMBER inside a parallel region will introduce lock contention as there is a single seed per program. One way around this is to generate num_of_flips random numbers before the parallel loop, and then just reference the jth value. You can allocate an array of the correct size and call RANDOM_NUMBER once to fill it.

Pisani__Will · ‎07-03-2018

Thanks! My code is running much faster now. Another thing that probably wasn't helping was that my project was tagged as "Debug" and not "Release" and so optimizations weren't being applied. Now the code takes about 4 seconds with one thread and num_of_flips set to 10E8.

Here's the revised code:

!  coin_flip_omp.f90 
!
!  FUNCTIONS:
!  coin_flip - Entry point of console application.
!

!****************************************************************************
!
!  PROGRAM: coin_flip
!
!  PURPOSE: A simple parallel coin flipping program
!
!****************************************************************************

    program coin_flip_omp
    ! Import libraries
    use omp_lib
    
    implicit none

    ! Variables
    integer, parameter :: MyLongIntType = selected_int_kind (10)
    integer (kind=MyLongIntType) :: num_of_flips = 10E8
    integer (kind=MyLongIntType) :: count = 0, j, i
    integer :: proc_num, thread_num
    real, allocatable :: rand_num_array(:)
    real :: seconds
    
    ! Begin timing program
    seconds = omp_get_wtime()
    
    ! Allocate rand_num_array
    allocate(rand_num_array(num_of_flips))
    
    ! Program statement
    print *, 'This program will flip a coin', num_of_flips, 'times and report on the number of heads'
    
    ! Generate an array num_of_flips long of random numbers
    call RANDOM_NUMBER(rand_num_array)
    
    print *, 'Time to generate random array: ', omp_get_wtime() - seconds, 'seconds'
    
    ! How many processors are available?
    proc_num = omp_get_num_procs()
    thread_num = 4
    
    ! Set number of threads to use
    call omp_set_num_threads(thread_num) 
    
    print *, 'Number of processors is    ', proc_num
    print *, 'Number of threads requested is ', thread_num
    
    ! Start while loop
    !$OMP PARALLEL DO REDUCTION(+:count)
    DO j = 1, num_of_flips
        ! if the jth value is less than 0.5, then call it heads        
        IF (rand_num_array(j) < 0.5) THEN
            count = count + 1
        END IF
    END DO
    !$OMP END PARALLEL DO
    
    ! End timing 
    seconds = omp_get_wtime() - seconds
    
    ! Print the number of heads
    print *, 'The number of heads is ', count
    print *, 'The percentage of heads is ', dble(count)/dble(num_of_flips)*100
    print *, 'Time taken to run:', seconds, 'seconds'

    end program coin_flip_omp

LeonardB · ‎07-07-2018

One way of coming around the lock preventing RANDOM_NUMBER to run in parallell in OpenMP is to use coarrays instead. The code below seeds the differnt images with separate seeds by shifting around the seed of image no 1

    subroutine co_RANDOM_SEED()
    implicit none
    integer::i
    integer,save:: seed(2)   ! the seed of each image
    integer:: startseed(2)      ! the start seed as two 32 bit integers
    integer(8):: i8seed         ! the seed as one 64 bit integer 
    pointer(locI8seed,i8seed)	! instead of EQUIVALENCE use pointer

    if(this_image().eq.1)then
        CALL RANDOM_SEED ( )                ! create a seed from time
        call RANDOM_SEED(get=startseed)     ! get the start seed
        do i=1,num_images()                 ! shift around the start seed to get a different seed for each image
            locI8seed=loc(startseed)        
            i8seed=ishc(I8seed,64/num_images())
            seed(1)=startseed(1)
            seed(2)=startseed(2)
        enddo
    endif
    sync all
    call RANDOM_seed(put=seed)              ! distribute the seeds to the different images
  
    end