Poor openmp performance

Ronglin__J_ · ‎08-15-2013

We have E5-2670 * 2, 16 cores in total.
We get the openmp performance as follows (the code is also attached below):

NUM THREADS:           1
Time:    1.53331303596497

NUM THREADS:           2
Time:   0.793078899383545

NUM THREADS:           4
Time:   0.475617885589600

NUM THREADS:           8
Time:   0.478277921676636

NUM THREADS:          14
Time:   0.479882955551147

NUM THREADS:          16
Time:   0.499575138092041

OK, this scaling is very poor when the thread number larger than 4.
But if I uncomment the lines 17 and 24, let the initialization is
also done by openmp. The different results are:

NUM THREADS:           1
Time:    1.41038393974304

NUM THREADS:           2
Time:   0.723496913909912

NUM THREADS:           4
Time:   0.386450052261353

NUM THREADS:           8
Time:   0.211269855499268

NUM THREADS:          14
Time:   0.185739994049072

NUM THREADS:          16
Time:   0.214301824569702

Why the performances are so different?

Some information:
ifort version 13.1.0
ifort -warn -openmp -vec-report=4 openmp.f90

[fortran]

PROGRAM OMPTEST
    use omp_lib
    !use mpi
    implicit none
    integer(4), parameter :: nx = 512, ny = 512, nz = 1024
    integer(4) :: ip, np, idx, nTotal = nx * ny * nz
    real(8) :: time, dx, dy, dz, bstore
    real(8), dimension(:), allocatable :: bx, ey, ez, hx
!------------------------------------------------------------------------------|
!   initial
!------------------------------------------------------------------------------|
    dx = 0.3; dy = 0.4; dz = 0.5
    allocate(bx(nTotal))
    allocate(ey(nTotal))
    allocate(ez(nTotal))
    allocate(hx(nTotal))
!    !$OMP PARALLEL DO PRIVATE(idx)
    do idx = 1, nTotal
       bx(idx) = idx
       ey(idx) = idx * 2
       ez(idx) = idx / 2
       hx(idx) = idx + 1
    enddo
!    !$OMP END PARALLEL DO
!------------------------------------------------------------------------------|
!   start
!------------------------------------------------------------------------------|
    time = omp_get_wtime()
    !$OMP PARALLEL PRIVATE(ip, bstore, idx)
    !$OMP MASTER
    np = omp_get_num_threads()
    !$OMP END MASTER
    ip = omp_get_thread_num()
    !$OMP DO
    do idx = 1, nTotal - 1
        bstore = bx(idx)
        bx(idx) = 2.0 * ((ey(idx + 1) - ey(idx)) / dz -                        &
            (ez(idx + 1) - ez(idx)) / dy)
        bx(idx) = 1.0 * bx(idx) + 2.0 * ((ey(idx + 1) - ey(idx)) / dz -      &
            (ez(idx + 1) - ez(idx)) / dy)
        hx(idx)= 3.0 * hx(idx) + 4.0 * (5.0 * bx(idx) - 6.0 * bstore)
    end do
    !$OMP END DO
    !$OMP END PARALLEL
!------------------------------------------------------------------------------|
!   end
!------------------------------------------------------------------------------|
    print*, "NUM THREADS:", np
    print*, "Time: ", omp_get_wtime() - time
    print*, "Result:", sum(hx)
    deallocate(bx, ey, ez, hx)
end

[/fortran]

jimdempseyatthecove · ‎08-16-2013

The reason for the difference is when the first loop is parallized, the iteration space 1:nTotal-1 is partitioned by the number of thread in the thred team. Same true for the second loop's iteration space 1:nTotal-1. In the first loop bx(idx), ey(idx), ez(idx), hx(idx), for the index sub-range for a specific thread, are written not only to the RAM locations of the arrays, but also into the cache system used by the corrisponding threads as read by the second loop (due to same partitioning). IOW the second loop has higher probability of cache hit. Also, if your system BIOS was configured as NUMA, and if the runtime system is setup as "first touch", then at page level granularity, the pages of the corrisponding locations "touched" (written) by the first loop, will reside in the RAM attached (nearer) to the socket of the thread that first touches the RAM of a given page. Then for locations subsiquently referenced by the second loop that were not within a cache, then these would have faster RAM access (due to being located on the RAM directly attached to the CPU within which the thread resides).

Your program is an execellent example of why one should parallelize the initialization of data in the same manner as subsequent processing of the data.

Jim Dempsey

Andrey_Vladimirov · ‎08-16-2013

Jim,

I am also interested in understanding the difference in performance. But I am doubtful about the local cache/local NUMA pages explanation because

1) The amount of data is 512*512*1024*4*4=1 GB, which is much greater than the L3 cache of two 8-core Xeons (~60 MB)

2) When I modified the code to run the processing loop (line 35) twice, the run time was identical for both runs. That holds with parallel or serial initialization. If the cache hit ratio was an issue, then the second run must have been faster than the first.

3) Also, I eliminated the NUMA hypothesis by using 16 threads and KMP_AFFINITY=compact (my system is 2-socket and has 32 logical cores). With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket. However, when I run this code with multithreaded initialization, I get faster processing than with serial initialization.

Andrey

P.S.: Ronglin, if you do not declare the loop index "idx" as PRIVATE, the overall performance increases

jimdempseyatthecove · ‎08-16-2013

>>With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket

Have you verified this? The behavior seems to be contradictory. The "only" difference, assuming same socket for all threads, would be as to if the non-master threads had begun the timed region in an expired KMP_BLOCK_TIME state.

Have you run the timed loop several times under VTune to see what is going on? (set loop count to about 15-30 seconds to get a meaningful statistical sample).

Jim Dempsey

TimP · ‎08-17-2013

openmp makes the parallel loop index private by default. to take advantage of first touch locality you will need affiinity set. for one thread per core with ht you might set kmp_affinity=compact,1,1

Ronglin__J_ · ‎08-18-2013

To Jim: I realized that the initialization is also important to improve the efficiency of OMP parallelization. Thank you.

To Andrey and TimP: I set KMP_AFFINITY=compact, but the results seem even worse.

Thank all of you for reply.