Trying to identify performance issue in OpenMP code

crtierney42 · ‎03-30-2010

Below is a small OpenMP code that I am trying to test on a 2.8 GHz Nehalem node using Intel 11.1 compilers (038 to be exact). Here are the performance results

Threads Case1 Case2 PGI
----------------------------------------------------------------
Serial 4.1 4.0 4.8
1 6.1 4.3 4.0
2 3.5 3.6 2.4
4 3.5 3.0 2.2
8 3.9 2.6 2.1

Values are in seconds. Each code is compiled with -O3 -xSSE4.2 -openmp, except serial which is compiled without the -openmp. The difference in each code is that Case1 has the two arrays initialized implicitly (a=100,b=100), where in Case 2 the arrays are explicitly assigned. PGI results are from PGI 10.2.

I believe that the difference in performance of the two is caused by memory affinity. Note, that even in the best case
the code here does not scale well. On an IBM P6, the performance scales by a factor of 7 with OMP_NUM_THREADS=8.

Experiments with KMP_AFFINITY with compact and scatter did not make a significant difference, and in sometimes worse.

Questions:

1) Is it possible to make Case1 go as fast as Case2?
2) Why doesn't going from 1 thread to 8 threads show better improvement (whether it is a compiler issue or hardware issue)?
3) What tools does Intel have that will help me investigate what is going on?

I haven't had time, but I do intend to convert this to MPI and I am almost positive that I will get at least a 4x speedup if not more using all the cores.

--------------------------------------

program relax
real, allocatable :: a(:,:),b(:,:)
mbyte=262144
nb=1000
allocate(a(0:mbyte+1,nb))
allocate(b(0:mbyte+1,nb))

!!!! For Case 1
! a=100
! b=100

!!!! For case 2
do n=1,nb
!$OMP parallel do
do k=1,mbyte
a(k,n)=100
b(k,n)=100
enddo
enddo

do iter=1,100
!$OMP parallel do
do n=1,nb
do k=1,mbyte
b(k,n)=(a(k+1,n)+a(k-1,n))/2.
end do
end do
!$OMP parallel do
do n=1,nb
do k=1,mbyte
a(k,n)=b(k,n)
end do
end do
call sub(a)
print *,iter
end do
stop
end
subroutine sub()
return
end

Gergana_S_Intel · ‎03-30-2010

Moving this over to the appropriate thread.

Regards,
~Gergana

TimP · ‎03-31-2010

PGI compiler probably invokes the equivalent of VECTOR NONTEMPORAL implicitly, if you use the right options. Users of PGI commonly set affinity options at compile time; you didn't say if you are among them.

Intel OpenMP doesn't support parallelization of simple array assignment.
ifort doesn't convert /2. automatically to *.5 unless you have set -noprec-div, or got that effect as a result of setting risky (default?) options.

If you are more careful about optimization under MPI, certainly you may get better performance. You ought to get perhaps 40% higher performance with 2 threads or MPI processes than 1 (case 2 vs. case 1), if you are careful about optimization and use of both sockets on 2-socket platform. What little you show here certainly doesn't justify use of multiple cores, and would display the pitfalls of mis-use of HyperThreads.

crtierney42 · ‎04-02-2010

Thanks for the detailed reply. I looked at what was going on, and I think I understand better.

1) The MPI version goes about 3x faster between 1 core and 8 cores, this is still better than the OpenMP version, but
not that much
2) From the result above,and from looking at the code it is memory bandwidth limited
2) The 7x reported before on an IBM P6 is when each thread is laid out on its own socket, instead of using both cores before moving to the next socket. This means that the process using 8 threads had access to 8 memory controllers, explaining the linear speed-up.

So the last questions are:

1) There is still a descrepancy between the MPI and OpenMP versions. Should this be expected or for this
size core count or in general?
2) Why when compiling F90 code with OpenMP, are arrays not initialized in parallel?
3) Why do you say this case doesn't justify use of multiple cores and the falls of mis-use of Hyperthreads? HT is disabled
on our servers.

Thanks,
Craig

jimdempseyatthecove · ‎04-05-2010

I notice that you allocate a and b as (0:mbyte+1,nb) but process as (1:mbyte,1:nb)
This may cause alignment inefficencies when vectorizing the code.
Check to see if you can perform one of:
a) allocate (1:mbyte, nb)
or
b) include the 0'th (and mbyte'th) element(s) in your loop.
or
c) allocate (-3:mbyte+1,nb) and discard/ignore the -1:, -2:, -3: cells (real*4)

Jim Dempsey

Grant_H_Intel · ‎04-06-2010

Your code shows that you are using different threads to initialize the arrays than the threads you use to do the computation. Why not parallelize the outer initialization loop so that the data is initialized in at least the same last-level cache as the one it will be operated on?

I also have an answer for one of your questions:

2) Why when compiling F90 code with OpenMP, are arrays not initialized in parallel?

Because it is not typically possible for the compiler to tell how the data should be split up among the threads on a NUMA system. (For your VERY simple example, it may be possible, but not for large applications in general.) The programmer should know this however and be able to do a good job (using parallel do with default static scheduling clause). You have to move the data around using MPI explicitly, which is the equivalent kind of optimization.

Finally, why not combine the paralleldo loops that do the calucation of a() and b() into a single parallel region?. (See !$omp parallel and !$omp do. I'm suggesting using one "!$ompparallel" with two "!$omp do loops inside".)

If you do these simple things, you may get a bit more speedup compared to the MPI version (assuming you are getting some cache re-use already). But if you are truly bandwidth limited, the only way to break through that performance barrier is to make better re-use of the caches. This often involves careful algorithm restructuring and requires significant amounts of work (like MPI).