Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

parallel code running slower than serial code

Claudio_C_
Beginner
2,152 Views

I tried to write a simple code to repeatedly compute greek pi by simulation and then compare the performance of the serial vs parallelized version of the code. To my great surprise the parallel code was slower ! Since I am a beginner I suspect I am not grasping some key aspects of parallel programming. Below I report the whole code. I am working with a version of an Intel 6700 processor with 4 cores.

I don't know if this forum is for this kind of questions, but thanks in advance for any help youc an give me.

PROGRAM:

program pigreco

! This program computes the value of greek pi "n" times using simulation
! Each time the computation is performed using "m" draws
! The computation is carried out by the subroutine "montec"
! In the end the average of th n simulations is computed and printed on screen
    
implicit none

integer i,n,m
parameter(n=3200,m=250000)
double precision greekpi(n),outp,avpi,den
double precision start_time,end_time

integer chunk,nthreads,omp_get_num_threads
parameter (chunk=400)

call CPU_TIME(start_time)

!$omp parallel private(i)
nthreads = omp_get_num_threads()
print*, 'number of threads',nthreads

!$omp do schedule(dynamic,chunk)
do i = 1,n
   call montec(m,outp)
   greekpi(i) = outp
   outp = 0.0d0
   !print*, i,greekpi(i)
end do
!$omp end do

 

!$omp end parallel

call CPU_TIME(end_time)

print*, 'average value of greek pi'
den = n
avpi = sum(greekpi)/den
print*, avpi

print*, 'running time'
print*, end_time - start_time


end program
    
subroutine montec(ndr,sol)
implicit none
integer ndr
double precision sol

integer i
double precision xr1,xr2,yv(ndr),sumsq,totins,tot

totins = 0.0d0
do i = 1,ndr
   call RANDOM_NUMBER(xr1)
   call RANDOM_NUMBER(xr2)
   sumsq = xr1**2.0d0 + xr2**2.0d0
   if (sumsq.le.1.0d0) then
       totins = totins + 1.0d0
   end if
end do   

tot = ndr

sol = totins/tot
sol = 4.0d0*sol

return
end subroutine

0 Kudos
5 Replies
Gregg_S_Intel
Employee
2,152 Views

Variable "outp" should be private.

Not sure dynamic,400 is a good idea.  Start with default "static".

0 Kudos
Claudio_C_
Beginner
2,152 Views

I have tried both to make "outp" private and change "dynamic" to "static", in the latter case both letting the computer set the size of each chunk to pass to a thread and setting it myself. Neither worked: the code is still runs much slower than the serial version.

0 Kudos
McCalpinJohn
Honored Contributor III
2,152 Views

Are you sure that the RANDOM_NUMBER function is thread-safe?

Most random number generators update some internal state after computing a new number.  If this is protected by a lock, then the threads will have to process this function one at a time, and the overhead of handling the lock may be larger than the savings in any other parallel work.

0 Kudos
SergeyKostrov
Valued Contributor II
2,152 Views
>>...Most random number generators update some internal state after computing a new number. If this is protected by a lock, >>then the threads will have to process this function one at a time, and the overhead of handling the lock may be larger than >>the savings in any other parallel work. It could be easily verified and modify codes as follows: ... totins = 0.0d0 do i = 1,ndr ! call RANDOM_NUMBER(xr1) ! call RANDOM_NUMBER(xr2) xr1 = 1.0 xr2 = 2.0 sumsq = xr1**2.0d0 + xr2**2.0d0 if (sumsq.le.1.0d0) then totins = totins + 1.0d0 end if end do ... Even if the value of PI will be incorrect it should be faster then a single threaded processing.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,152 Views

RANDOM_NUMBER is thread-safe, however, in order to be thread-safe, it uses a critical section (serializing section). In cases like this what you do is call RANDOM_NUMBER outside the parallel region with an argument that is an array (not scalar). The size of the array would typically be the iteration count of the parallel loop that follows. Then within the parallel loop, to get the random number, you index the array with the loop index. Using the array (harvest) format of RANDOM_NUMBER your program crosses the critical region once as opposed to on each iteration.

Note, in your case you would include:

double precision harvest(ndr*2)
...
call RANDOM_NUMBER(harvest)
...
!$omp parallel
...
!$omp do
...
call montec(m,outp,harvest) ! add harvested array of random numbers
...


subroutine montec(ndr,sol,harvest)
...
double precision harvest(ndr*2)
...
do i = 1,ndr
   xr1 = harvest((i-1)*2+1)
   xr2 = harvest((i-1)*2+2)

Jim Dempsey
...

 

 

0 Kudos
Reply