OpenMP program runs slower on 4 CPUs

fjeske · ‎08-18-2008

Can someone explain why the following simple OpenMP program runs slower on 4 CPUs than 1? It nicely burns all four processors but doesn't seem to get any useful work out of them.

We've built it with the following command using IFV 10.1.021 on a Core 2 Quad:

ifort -openmp -o parallel paralle.f

- parallel.f ------------------------
program parallel

use omp_lib

integer i,count
parameter(count=200000)
real x(count),y(count),z(count),t
COMMON/x/x,y,z
integer nt,tid

nt=omp_get_max_threads()
write(*,*) 'threads=',nt

C$OMP PARALLEL PRIVATE(tid)
tid=omp_get_thread_num()
write(*,*) 'loop1=',tid

C$OMP DO
do i=1,count
x(i)=i
y(i)=i
end do
C$OMP END PARALLEL

C$OMP PARALLEL
C$OMP DO
do i=1,count
do j=1,count
z(j)=x(i)+y(j)
end do
end do
C$OMP END PARALLEL

t=0.0
C$OMP PARALLEL REDUCTION(+:t)
C$OMP DO
do i=1,count
t=t+z(i)
end do
C$OMP END PARALLEL

write(*,*) t

end
--------------------------------------------

jimdempseyatthecove · ‎08-18-2008

fjeske,

The do loops you are using are performing very little computation between reads and/or writes. This results in each thread being capable of saturating the memory bus.

Jim Dempsey

fjeske · ‎08-18-2008

We were thinking that this could be possible. This example code is an obvious reduction of a much large program that has similar constructs but does a lot more work per iteration and hence wouldn't expect the same result from that.

How do we then prove to ourselves that we saturating the bus and not uselessly spinning the CPUs? Thread profiling?

Steve_Nuchia · ‎08-18-2008

Figure out which level of the memory hierarchy you are running from and look up the theoretical bandwidth from that level to the FPU for your hardware.

Then calculate the effective memory bandwidth for your computation: (total bytes read + total bytes written) / seconds elapsed.

If the two numbers are comparable, you're saturating.

For the small arrays you're using, though, you ought to be able to run out of the L2. Do you have multiple CPU chips (sockets) or just multiple cores? If all your processors are sharing one L2 that won't help much and it may not fit.

If the job will fit in 4x the L1 cache size but not in any one L1 you might want to force an explicit schedule choice and set processor afinity so each stripe through the arrays stays in the L1 of one or another processor.

Also, with such small arrays your thread creation overhead will be a significant fraction of the running time of the OMP job. Using one OMP PARALLEL directive would help there, or at least it should.

Another thing to do is to use VTune to see where the pipelines are stalling. Interpreting the results takes some experience but if it's all memory I/O that should be pretty obvious.

Steve_Nuchia · ‎08-18-2008

fjeske:

C$OMP PARALLEL
C$OMP DO
do i=1,count
do j=1,count
z(j)=x(i)+y(j)
end do
end do
C$OMP END PARALLEL

After further review, this is the only loop that should be impacting the performance of the cut-down example. It has an architectural problem: the inner loop is controlling the index into the destination array so all threads are writing it. This could easily run slower than the single-threaded version due to sharing dirty cache lines.

Invert the loops and each thread will have an essentially private output buffer. Make sure the scheduling options in effect for the loop give each thread a nice big chunk to work on; if it interleaves the j values it will still be slow.

Eliminate the OMP directives on the other loops entirely and I bet it speeds up slightly.

fjeske · ‎08-18-2008

Thanks, that was problem here but an unintentional typo in this simple example. It now scales by 3.95x on the 4 CPUs. The lesson learned is that memory contention can be a significant issue.

Now, unfortunately, I have to figure out what's wrong with the real program.

fjeske · ‎08-18-2008

So what's the best tools to find such issues both in terms of code and memory contention? Can something like Thread Check notice the same coding issue? What can tell me that I have a memory contention issue and where to find and fix it?

Steven_L_Intel1 · ‎08-18-2008

Yes - this is just the sort of thing Intel Thread Checker excels at.