Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Parallelization problem

kirab
Beginner
733 Views
I have a simple program to test OpenMP parallelization.
========================================================
program test
use omp_lib
implicit none

integer i, j, k, n, np, npm
real(8) t1
real, allocatable :: a(:,:,:), b(:,:,:), dt(:)

n = 500
allocate( a(n,n,n), b(n,n,n) )

a = 1.0
b = 2.0

npm = omp_get_max_threads()
allocate( dt(npm) )

do np=1,npm
call omp_set_num_threads(np)
t1 = omp_get_wtime()
call sum_mat( n, a, b )
dt(np) = omp_get_wtime() - t1
enddo

do np=2,npm
print *, np, dt(1)/dt(np)
enddo
end

subroutine sum_mat( n, a, b )
implicit none
integer n
real a(n,n,n), b(n,n,n)

integer i, j, k

!$omp parallel do
do k=1,n
do j=1,n
do i=1,n
a(i,j,k) = a(i,j,k) + b(i,j,k)
enddo
enddo
enddo
end
========================================================

I use IFC 10 and make exe file by the next command:
ifort /nologo /O2 /Qfpp2 /Qopenmp s5.f90

When I run this program several times on my Dual Core processor, I get the next results:
2 0.9937300
2 0.9861304
2 1.004650

Why I cannot get speed up about 2? What is incorrect in the program?

0 Kudos
7 Replies
Steve_Nuchia
New Contributor I
733 Views
Your program is memory bandwidth-bound. You get two cores waiting for cache fills instead of one, no gain.
0 Kudos
jimdempseyatthecove
Honored Contributor III
733 Views

You also may be experiencing a cache loading problem. Try running the same test inside the same application

do i=1,5
print *, 'Iteration ',i
do np=1,npm
call omp_set_num_threads(np)
t1 = omp_get_wtime()
call sum_mat( n, a, b )
dt(np) = omp_get_wtime() - t1
enddo
 do np=2,npm
print *, np, dt(1)/dt(np)
enddo
end do

The 1st iteration generally is longer as it populates the cache.

If one of the other test runs is longer than the others this may indicate the operating system interfering with the program.

Jim Dempsey


					
				
			
			
				
			
			
			
			
			
			
			
		
0 Kudos
kirab
Beginner
733 Views
It sounds like there is no way to improve performance of the program due to several cores. Is it true?
0 Kudos
kirab
Beginner
733 Views
JimDempseyAtTheCove:

You also may be experiencing a cache loading problem. Try running the same test inside the same application

The 1st iteration generally is longer as it populates the cache.

If one of the other test runs is longer than the others this may indicate the operating system interfering with the program.



Thank you for advice, but I get the same results.
0 Kudos
TimP
Honored Contributor III
733 Views
In order to see a warm cache effect, where you get parallel speedup on the 2nd repetition, the big arrays would have to be made small enough to fit in cache.
0 Kudos
jimdempseyatthecove
Honored Contributor III
733 Views

>>In order to see a warm cache effect, where you get parallel speedup on the 2nd repetition, the big arrays would have to be made small enough to fit in cache.

That is correct (array sizes were not listed in users original post).

On the flip side, if you know you have larger than cache size arrays then break up the problem into chunks that fit within cache, then you can (potentially) work your way through the large array(s) and improve cache utilization. There are some system calls to get the sizes of L1, L2 (and L3 if present) cache. This technique has various names one of which is called striping.

Jim Dempsey

0 Kudos
Steve_Nuchia
New Contributor I
733 Views

Array sizes were listed in the original post: 500**3 = 125 million 4-byte reals * 2 arrays is a solid gigabyte. He's reading half a gig twice and writing it once in almost exactly 1 second, presumably on a farily recent Intel single-socket desktop machine. Hence, he's within a factor of 2 of the memory controller's maximum sustained bandwidth.

But he ought to be able to get closer to theoretical than that for this operation, even with one core ...

Oh, yeah: he's also allocating and initializing the arrays in that second. If the OS is zero-filling the memory that's two more GB transferred and we're at 3.5 GB/sec. You can't do better than that on his hardware no matter how many threads you use.

0 Kudos
Reply