- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi
I have previously written ccNUMA aware code in Fortran by initializing my arrays in parallel using the "first touch" principle , but it appears something has happened lately so this no longer works. For memory bandwidth sensitive code I used to see performance scale linearly with the number of NUMA nodes in the system, but running the code below I now obtain virtually identical results for both the NUMA and non-NUMA aware code ...
Any suggestions as to what is causing this? I have tested the code on both Intel 2 socket systems and AMD 4 socket systems with the same result ...
Best regards,
C
program Console6
use ifport
use omp_lib
implicit none
integer*8 :: I,J,N
integer :: Repetitions
real*8,allocatable :: iVector(:),oVector(:)
real*8 :: Runtimebegin,RuntimeEnd,FLops
logical :: Success
N=2e8
allocate(iVector(N))
allocate(oVector(N))
success = SETENVQQ("KMP_AFFINITY=verbose,scatter")
!$OMP PARALLEL
!Do nothing except for initializing the OMP threads ...
!$OMP END PARALLEL
call omp_set_num_Threads(8)
Repetitions=50
!initialize the data structure using first touch - everything will reside on the NUMA node of the master thread
do i=1,N
iVector(i)=1d0
oVector(i)=0d0
end do
!Perform calculation
RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
do j=1,Repetitions
do i=1,N
oVector(i)=oVector(i)+iVector(i)*0.01
end do
end do
!$OMP END DO
!$OMP END PARALLEL
print *,(oVector(1))
RuntimeEnd=omp_get_wtime()
Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
print *,'NO DISTRIBUTION ACROSS NUMA NODES ...'
print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops
!Deallocate the data and repeat the calculation with the data distributed across the NUMA nodes of the system
deallocate(iVector)
deallocate(oVector)
allocate(iVector(N))
allocate(oVector(N))
!Distribute the data across NUMA nodes using the first tough principle ...
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
do i=1,N
iVector(i)=1d0
oVector(i)=0d0
end do
!$OMP END DO
!$OMP END PARALLEL
RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
do j=1,Repetitions
do i=1,N
oVector(i)=oVector(i)+iVector(i)*0.01
end do
end do
!$OMP END DO
!$OMP END PARALLEL
print *,(oVector(1))
RuntimeEnd=omp_get_wtime()
Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
print *,'DATA DISTRIBUTED ACROSS NUMA NODES ...'
print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops
end program Console6
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ooops - sorry! I accidentally posted an early version of the code with an obvious error in it ... The correct code is found below ...
program Console6
use ifport
use omp_lib
implicit none
integer*8 :: I,J,N
integer :: Repetitions
real*8,allocatable :: iVector(:),oVector(:)
real*8 :: Runtimebegin,RuntimeEnd,FLops
logical :: Success
N=5e8
allocate(iVector(N))
allocate(oVector(N))
success = SETENVQQ("KMP_AFFINITY=verbose,scatter")
!$OMP PARALLEL
!Do nothing except for initializing the OMP threads ...
!$OMP END PARALLEL
call omp_set_num_Threads(8)
Repetitions=20
!initialize the data structure using first touch - everything will reside on the NUMA node of the master thread
do i=1,N
iVector(i)=1d0
oVector(i)=0d0
end do
!Perform calculation
RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
do j=1,Repetitions
!$OMP DO SCHEDULE(STATIC)
do i=1,N
oVector(i)=oVector(i)+iVector(i)*0.01
end do
!$OMP END DO
end do
!$OMP END PARALLEL
print *,(oVector(1))
RuntimeEnd=omp_get_wtime()
Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
print *,'NO DISTRIBUTION ACROSS NUMA NODES ...'
print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops
!Deallocate the data and repeat the calculation with the data distributed across the NUMA nodes of the system
deallocate(iVector)
deallocate(oVector)
allocate(iVector(N))
allocate(oVector(N))
!Distribute the data across NUMA nodes using the first tough principle ...
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
do i=1,N
iVector(i)=1d0
oVector(i)=0d0
end do
!$OMP END DO
!$OMP END PARALLEL
RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
do j=1,Repetitions
!$OMP DO SCHEDULE(STATIC)
do i=1,N
oVector(i)=oVector(i)+iVector(i)*0.01
end do
!$OMP END DO
end do
!$OMP END PARALLEL
print *,(oVector(1))
RuntimeEnd=omp_get_wtime()
Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
print *,'DATA DISTRIBUTED ACROSS NUMA NODES ...'
print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops
end program Console6
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can close this post - this simple example is actually working fine so the problem must be located somewhere in my actual code base ...
Sorry for jumping to conclusions ...
Best regards,
C
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I finally found the cause of the problem and share it here since it might save other people time. In the example below the value of the KMP_AFFINITY environment variable is never applied, despite being correctly set. However, if you remove the call to KMP_SET_STACKSIZE_S it works just fine.
I have reported the problem to premier support ...
Cheers,
C
program Console7
use ifport
use omp_lib
implicit none
logical(4) :: LSuccess
integer*4 :: ISuccess
character(80) :: Val
call KMP_SET_STACKSIZE_S(1000000)
Lsuccess = SETENVQQ("KMP_AFFINITY=verbose,scatter")
ISuccess =GETENVQQ("KMP_AFFINITY", val)
print *,Val
!$OMP PARALLEL
!$OMP END PARALLEL
print *,'We should see affinity settings on screen!'
end program Console7
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What happens if you perform SETENVQQ("KMP_AFFINITY=verbose,scatter") before call KMP_SET_STACKSIZE_S(1000000)?
The thought being: SETENVQQ is non-OpenMP function whereas KMP_SET_STACKSIZE is an OpenMP function which may have the side effect of instantiating the OpenMP thread pool (before setting the environment variable).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim
It works as intended if I follow your suggestion and make the calls in reverse order, but isn't this behaviour in conflict with the OpenMP specification? I have always been told that the thread pool is initialized at the first user openMP region, but that could be wrong of course ...
C

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page