Parallelization problem

kirab · ‎07-22-2008

I have a simple program to test OpenMP parallelization.
========================================================
program test
use omp_lib
implicit none

integer i, j, k, n, np, npm
real(8) t1
real, allocatable :: a(:,:,:), b(:,:,:), dt(:)

n = 500
allocate( a(n,n,n), b(n,n,n) )

a = 1.0
b = 2.0

npm = omp_get_max_threads()
allocate( dt(npm) )

do np=1,npm
call omp_set_num_threads(np)
t1 = omp_get_wtime()
call sum_mat( n, a, b )
dt(np) = omp_get_wtime() - t1
enddo

do np=2,npm
print *, np, dt(1)/dt(np)
enddo
end

subroutine sum_mat( n, a, b )
implicit none
integer n
real a(n,n,n), b(n,n,n)

integer i, j, k

!$omp parallel do
do k=1,n
do j=1,n
do i=1,n
a(i,j,k) = a(i,j,k) + b(i,j,k)
enddo
enddo
enddo
end
========================================================

I use IFC 10 and make exe file by the next command:
ifort /nologo /O2 /Qfpp2 /Qopenmp s5.f90

When I run this program several times on my Dual Core processor, I get the next results:
2 0.9937300
2 0.9861304
2 1.004650

Why I cannot get speed up about 2? What is incorrect in the program?

Steve_Nuchia · ‎07-22-2008

Your program is memory bandwidth-bound. You get two cores waiting for cache fills instead of one, no gain.

jimdempseyatthecove · ‎07-22-2008

You also may be experiencing a cache loading problem. Try running the same test inside the same application

do i=1,5
 print *, 'Iteration ',i
 do np=1,npm
 call omp_set_num_threads(np)
 t1 = omp_get_wtime()
 call sum_mat( n, a, b )
 dt(np) = omp_get_wtime() - t1
 enddo

 do np=2,npm
 print *, np, dt(1)/dt(np)
 enddo
end do

The 1st iteration generally is longer as it populates the cache.

If one of the other test runs is longer than the others this may indicate the operating system interfering with the program.

Jim Dempsey

kirab · ‎07-23-2008

It sounds like there is no way to improve performance of the program due to several cores. Is it true?

kirab · ‎07-23-2008

JimDempseyAtTheCove:
You also may be experiencing a cache loading problem. Try running the same test inside the same application
The 1st iteration generally is longer as it populates the cache.

If one of the other test runs is longer than the others this may indicate the operating system interfering with the program.

Thank you for advice, but I get the same results.

TimP · ‎07-23-2008

In order to see a warm cache effect, where you get parallel speedup on the 2nd repetition, the big arrays would have to be made small enough to fit in cache.

jimdempseyatthecove · ‎07-23-2008

>>In order to see a warm cache effect, where you get parallel speedup on the 2nd repetition, the big arrays would have to be made small enough to fit in cache.

That is correct (array sizes were not listed in users original post).

On the flip side, if you know you have larger than cache size arrays then break up the problem into chunks that fit within cache, then you can (potentially) work your way through the large array(s) and improve cache utilization. There are some system calls to get the sizes of L1, L2 (and L3 if present) cache. This technique has various names one of which is called striping.

Jim Dempsey

Steve_Nuchia · ‎07-24-2008

Array sizes were listed in the original post: 500**3 = 125 million 4-byte reals * 2 arrays is a solid gigabyte. He's reading half a gig twice and writing it once in almost exactly 1 second, presumably on a farily recent Intel single-socket desktop machine. Hence, he's within a factor of 2 of the memory controller's maximum sustained bandwidth.

But he ought to be able to get closer to theoretical than that for this operation, even with one core ...

Oh, yeah: he's also allocating and initializing the arrays in that second. If the OS is zero-filling the memory that's two more GB transferred and we're at 3.5 GB/sec. You can't do better than that on his hardware no matter how many threads you use.