Re: OpenMP no speedup

Olga · ‎05-22-2009

Hello,
I'm trying to make my first steps in using OpenMP. So, I have written the test program that adds two 2D arrays and used OpenMP to parallelize the outermost loop like this:

double precision, allocatable, dimension(:,:):: a,b,c
double precision t1, t2
integer nt, nr
integer i,j

nt = 2048*4
nr = 600

allocate(a(nt,nr),b(nt,nr),c(nt,nr))
do j = 1, nr
do i = 1, nt
a(i,j) = i+j
b(i,j) = j-i
end do
end do
t1 = omp_get_wtime()
!$omp parallel do shared(nr,nt,a,b,c) private(j,i)
do j = 1, nr
do i = 1, nt
c(i,j)=a(i,j)+b(i,j)
end do
end do
!$omp end parallel do
t2 = omp_get_wtime()
write(*,*) 'time elapsed', t2-t1

I have tested my program on different values of pairs (nt,nr): 512 <= nt <= 524288, 100 <= nr <= 6400, nt*nr <=52428800. Only in a few cases I got, that parallel execution is faster than sequential. But I never got the speed up more than 1.25. For many combinations of (nt,nr) using OpenMP led to slower execution.

I compiled with the following keys:
/nologo /fpp /I"C:\Program Files\Intel\Compiler\11.0\072\fortran\mkl\include" /real_size:64 /module:"Release\" /object:"Release\" /libs:static /threads /c
with or without /Qopenmp key to switch on/off parallelization.

OMP_NUM_THREADS environment variable is equal to 2.

What can I be doing wrong?

Thanks in advance,
Olga

jimdempseyatthecove · ‎05-22-2009

Olga,

You are doing nothing wrong. With the little work thecode is doing inside the loop you have likely reached your memory bus capacity. What processor were you using for your test? (dual core or single core with HT)

Also, consider adding the SSE3 (or later) option switch /QxSSE3 or /QaxSSE3.

Jim Dempsey

Olga · ‎05-22-2009

Jim,

Thank you for your answer. I'm using Intel Core2 Duo.
How can I check that the memory bus is a bottleneck?

Olga

jimdempseyatthecove · ‎05-23-2009

Quoting - Olga

Jim,

Thank you for your answer. I'm using Intel Core2 Duo.
How can I check that the memory bus is a bottleneck?

Olga

Olga,

Compile in Debug mode with array bounds checking On.Your program willnowcontain extra code not functional for your application but functional for this test in the sense it makes your loop more computational as opposed to bottlenecked by memory bus.

On my system (2.4GHZ Q6600 quad core) your loop runs for a short time

Release Build
Threads
1 3.68E-2
2 2.63E-2 ~1.4x
3 2.79E-2
4 2.69E-2

Debug Build
1 0.1349
2 6.265E-2 ~2.15x
3 4.21E-2
4 3.238E-2

Now you won't run your application in Debug build, but the test illustrates that memory is the bottleneck in this case.

Jim Dempsey

Olga · ‎05-29-2009

Hi Jim,

In fact, I've tried to parallelize a large loop and I didn't see the speed up. So I tried to run a small test I described in the beginning. Your reply gave me some idea on why I don't have speed up in initial case.

Thanks,
Olga

jimdempseyatthecove · ‎05-29-2009

Olga,

The commonest causes for no speed-up

1) Running on single core machine
2) Running with 1 thread on multi-core machine
3) memory bandwidth saturated
ArrayA = ArrayB
ArrayA = ArrayB + ArrayC
ArrayA = ArrayA *Scalar
4) Calling function/subroutine with critical section
e.g. random number generator, allocate, deallocate, ...

There are others. I think you have eliminated 1, and 2.

Jim Dempsey

pbkenned1 · ‎07-01-2009

Quoting - jimdempseyatthecove

Olga,

The commonest causes for no speed-up

1) Running on single core machine
2) Running with 1 thread on multi-core machine
3) memory bandwidth saturated
ArrayA = ArrayB
ArrayA = ArrayB + ArrayC
ArrayA = ArrayA *Scalar
4) Calling function/subroutine with critical section
e.g. random number generator, allocate, deallocate, ...

There are others. I think you have eliminated 1, and 2.

Jim Dempsey

Jim,
Thanks for summarizing the possible performance issues here. I analyzed the code with both PTU and Vtune, and I can confirm the issue is 3) -- this code is indeedseverely memory bandwidth limited.
Running a quick memory bandwidth analysis with PTU (available here):
http://software.intel.com/en-us/articles/intel-performance-tuning-utility/
will immediately show you that memory bandwidth isthe major performance limiting factor.

Running a Vtune sampling and looking at just CPU_CLOCK_UNHALTED.CORE, BUS_TRANS_BURST.SELF (full cacheline bus bursts), and MEM_LOAD_RETIRED.L2_LINE_MISS shows that this line of code...

c(i,j)=a(i,j)+b(i,j)

...executes more than 90% of all unhalted core clocks, and accounts for nearly 100% of all full cacheline bus bursts and L2 misses. Further, as you increase the number of OpenMP threads, the number of L2 miss events increases dramatically, whereas for a code to scale the number should stay steady or even decrease with an increasing number of threads. This explains why the code doesn't speedup with more threads, or even slows down as more threads are used.

It is also the case for this code, that switching to static allocation of the arrays makes virtually no performance difference...another good indicator of a bandwidth limited code.

Patrick Kennedy
Intel Developer Support

TimP · ‎07-01-2009

Adding the missing USE omp_lib and END, on a Core i7 with slow RAM setup, going from 1 thread to 4 thread cuts 45% off the run time. The imbalance time quoted by openmp_profile goes up sharply from 2 to 4 threads.