- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have E5-2670 * 2, 16 cores in total.
We get the openmp performance as follows (the code is also attached below):
NUM THREADS: 1
Time: 1.53331303596497
NUM THREADS: 2
Time: 0.793078899383545
NUM THREADS: 4
Time: 0.475617885589600
NUM THREADS: 8
Time: 0.478277921676636
NUM THREADS: 14
Time: 0.479882955551147
NUM THREADS: 16
Time: 0.499575138092041
OK, this scaling is very poor when the thread number larger than 4.
But if I uncomment the lines 17 and 24, let the initialization is
also done by openmp. The different results are:
NUM THREADS: 1
Time: 1.41038393974304
NUM THREADS: 2
Time: 0.723496913909912
NUM THREADS: 4
Time: 0.386450052261353
NUM THREADS: 8
Time: 0.211269855499268
NUM THREADS: 14
Time: 0.185739994049072
NUM THREADS: 16
Time: 0.214301824569702
Why the performances are so different?
Some information:
ifort version 13.1.0
ifort -warn -openmp -vec-report=4 openmp.f90
[fortran]
PROGRAM OMPTEST
use omp_lib
!use mpi
implicit none
integer(4), parameter :: nx = 512, ny = 512, nz = 1024
integer(4) :: ip, np, idx, nTotal = nx * ny * nz
real(8) :: time, dx, dy, dz, bstore
real(8), dimension(:), allocatable :: bx, ey, ez, hx
!------------------------------------------------------------------------------|
! initial
!------------------------------------------------------------------------------|
dx = 0.3; dy = 0.4; dz = 0.5
allocate(bx(nTotal))
allocate(ey(nTotal))
allocate(ez(nTotal))
allocate(hx(nTotal))
! !$OMP PARALLEL DO PRIVATE(idx)
do idx = 1, nTotal
bx(idx) = idx
ey(idx) = idx * 2
ez(idx) = idx / 2
hx(idx) = idx + 1
enddo
! !$OMP END PARALLEL DO
!------------------------------------------------------------------------------|
! start
!------------------------------------------------------------------------------|
time = omp_get_wtime()
!$OMP PARALLEL PRIVATE(ip, bstore, idx)
!$OMP MASTER
np = omp_get_num_threads()
!$OMP END MASTER
ip = omp_get_thread_num()
!$OMP DO
do idx = 1, nTotal - 1
bstore = bx(idx)
bx(idx) = 2.0 * ((ey(idx + 1) - ey(idx)) / dz - &
(ez(idx + 1) - ez(idx)) / dy)
bx(idx) = 1.0 * bx(idx) + 2.0 * ((ey(idx + 1) - ey(idx)) / dz - &
(ez(idx + 1) - ez(idx)) / dy)
hx(idx)= 3.0 * hx(idx) + 4.0 * (5.0 * bx(idx) - 6.0 * bstore)
end do
!$OMP END DO
!$OMP END PARALLEL
!------------------------------------------------------------------------------|
! end
!------------------------------------------------------------------------------|
print*, "NUM THREADS:", np
print*, "Time: ", omp_get_wtime() - time
print*, "Result:", sum(hx)
deallocate(bx, ey, ez, hx)
end
[/fortran]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The reason for the difference is when the first loop is parallized, the iteration space 1:nTotal-1 is partitioned by the number of thread in the thred team. Same true for the second loop's iteration space 1:nTotal-1. In the first loop bx(idx), ey(idx), ez(idx), hx(idx), for the index sub-range for a specific thread, are written not only to the RAM locations of the arrays, but also into the cache system used by the corrisponding threads as read by the second loop (due to same partitioning). IOW the second loop has higher probability of cache hit. Also, if your system BIOS was configured as NUMA, and if the runtime system is setup as "first touch", then at page level granularity, the pages of the corrisponding locations "touched" (written) by the first loop, will reside in the RAM attached (nearer) to the socket of the thread that first touches the RAM of a given page. Then for locations subsiquently referenced by the second loop that were not within a cache, then these would have faster RAM access (due to being located on the RAM directly attached to the CPU within which the thread resides).
Your program is an execellent example of why one should parallelize the initialization of data in the same manner as subsequent processing of the data.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
I am also interested in understanding the difference in performance. But I am doubtful about the local cache/local NUMA pages explanation because
1) The amount of data is 512*512*1024*4*4=1 GB, which is much greater than the L3 cache of two 8-core Xeons (~60 MB)
2) When I modified the code to run the processing loop (line 35) twice, the run time was identical for both runs. That holds with parallel or serial initialization. If the cache hit ratio was an issue, then the second run must have been faster than the first.
3) Also, I eliminated the NUMA hypothesis by using 16 threads and KMP_AFFINITY=compact (my system is 2-socket and has 32 logical cores). With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket. However, when I run this code with multithreaded initialization, I get faster processing than with serial initialization.
Andrey
P.S.: Ronglin, if you do not declare the loop index "idx" as PRIVATE, the overall performance increases
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>With OMP_NUM_THREADS=16 and KMP_AFFINITY=compact, all threads are placed on one CPU socket
Have you verified this? The behavior seems to be contradictory. The "only" difference, assuming same socket for all threads, would be as to if the non-master threads had begun the timed region in an expired KMP_BLOCK_TIME state.
Have you run the timed loop several times under VTune to see what is going on? (set loop count to about 15-30 seconds to get a meaningful statistical sample).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
openmp makes the parallel loop index private by default. to take advantage of first touch locality you will need affiinity set. for one thread per core with ht you might set kmp_affinity=compact,1,1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page