Problem distributing data across ccNUMA nodes

PKM · ‎09-29-2014

Hi

I have previously written ccNUMA aware code in Fortran by initializing my arrays in parallel using the "first touch" principle , but it appears something has happened lately so this no longer works. For memory bandwidth sensitive code I used to see performance scale linearly with the number of NUMA nodes in the system, but running the code below I now obtain virtually identical results for both the NUMA and non-NUMA aware code ...

Any suggestions as to what is causing this? I have tested the code on both Intel 2 socket systems and AMD 4 socket systems with the same result ...

Best regards,

C

    program Console6
    use ifport
    use omp_lib
    implicit none
    integer*8          :: I,J,N
    integer            :: Repetitions
    real*8,allocatable :: iVector(:),oVector(:)
    real*8             :: Runtimebegin,RuntimeEnd,FLops
    logical            :: Success
    N=2e8
    allocate(iVector(N))
    allocate(oVector(N))
    success = SETENVQQ("KMP_AFFINITY=verbose,scatter")
!$OMP PARALLEL
!Do nothing except for initializing the OMP threads ...
!$OMP END PARALLEL
   call omp_set_num_Threads(8)
   Repetitions=50
   !initialize the data structure using first touch - everything will reside on the NUMA node of the master thread
   do i=1,N
     iVector(i)=1d0
     oVector(i)=0d0
   end do
   !Perform calculation
   RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
   do j=1,Repetitions
     do i=1,N
      oVector(i)=oVector(i)+iVector(i)*0.01
     end do
   end do
!$OMP END DO
!$OMP END PARALLEL
    print *,(oVector(1))
    RuntimeEnd=omp_get_wtime()
    Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
    print *,'NO DISTRIBUTION ACROSS NUMA NODES ...'
    print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops

   !Deallocate the data and repeat the calculation with the data distributed across the NUMA nodes of the system
   deallocate(iVector)
   deallocate(oVector)
   allocate(iVector(N))
   allocate(oVector(N))
   !Distribute the data across NUMA nodes using the first tough principle ...
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
     do i=1,N
       iVector(i)=1d0
       oVector(i)=0d0
     end do
!$OMP END DO
!$OMP END PARALLEL

    RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
   do j=1,Repetitions
     do i=1,N
      oVector(i)=oVector(i)+iVector(i)*0.01
     end do
   end do
!$OMP END DO
!$OMP END PARALLEL
    print *,(oVector(1))
    RuntimeEnd=omp_get_wtime()
    Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
    print *,'DATA DISTRIBUTED ACROSS NUMA NODES ...'
    print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops

    end program Console6

PKM · ‎09-29-2014

Ooops - sorry! I accidentally posted an early version of the code with an obvious error in it ... The correct code is found below ...

   program Console6
    use ifport
    use omp_lib
    implicit none
    integer*8          :: I,J,N
    integer            :: Repetitions
    real*8,allocatable :: iVector(:),oVector(:)
    real*8             :: Runtimebegin,RuntimeEnd,FLops
    logical            :: Success
    N=5e8
    allocate(iVector(N))
    allocate(oVector(N))
    success = SETENVQQ("KMP_AFFINITY=verbose,scatter")
!$OMP PARALLEL
!Do nothing except for initializing the OMP threads ...
!$OMP END PARALLEL
   call omp_set_num_Threads(8)
   Repetitions=20
   !initialize the data structure using first touch - everything will reside on the NUMA node of the master thread
   do i=1,N
     iVector(i)=1d0
     oVector(i)=0d0
   end do
   !Perform calculation
   RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
     do j=1,Repetitions
!$OMP DO SCHEDULE(STATIC)
       do i=1,N
        oVector(i)=oVector(i)+iVector(i)*0.01
       end do
!$OMP END DO
    end do
!$OMP END PARALLEL
    print *,(oVector(1))
    RuntimeEnd=omp_get_wtime()
    Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
    print *,'NO DISTRIBUTION ACROSS NUMA NODES ...'
    print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops

   !Deallocate the data and repeat the calculation with the data distributed across the NUMA nodes of the system
   deallocate(iVector)
   deallocate(oVector)
   allocate(iVector(N))
   allocate(oVector(N))
   !Distribute the data across NUMA nodes using the first tough principle ...
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
!$OMP DO SCHEDULE(STATIC)
     do i=1,N
       iVector(i)=1d0
       oVector(i)=0d0
     end do
!$OMP END DO
!$OMP END PARALLEL

    RuntimeBegin=omp_get_wtime()
!$OMP PARALLEL private(i) shared(iVector,oVector,N)
     do j=1,Repetitions
!$OMP DO SCHEDULE(STATIC)
       do i=1,N
        oVector(i)=oVector(i)+iVector(i)*0.01
       end do
!$OMP END DO
   end do
!$OMP END PARALLEL
    print *,(oVector(1))
    RuntimeEnd=omp_get_wtime()
    Flops=2.0*N*Repetitions/((RunTimeEnd-RunTimeBegin)*1024**3)
    print *,'DATA DISTRIBUTED ACROSS NUMA NODES ...'
    print *,'Time=',RunTimeEnd-RuntimeBegin,'GFlops=',Flops

    end program Console6

PKM · ‎09-29-2014

You can close this post - this simple example is actually working fine so the problem must be located somewhere in my actual code base ...

Sorry for jumping to conclusions ...

Best regards,

C

PKM · ‎10-02-2014

I finally found the cause of the problem and share it here since it might save other people time. In the example below the value of the KMP_AFFINITY environment variable is never applied, despite being correctly set. However, if you remove the call to KMP_SET_STACKSIZE_S it works just fine.

I have reported the problem to premier support ...

Cheers,

C

    program Console7
    use ifport
    use omp_lib
    implicit none
    logical(4) :: LSuccess
    integer*4      :: ISuccess
    character(80) :: Val
    call KMP_SET_STACKSIZE_S(1000000)
    Lsuccess = SETENVQQ("KMP_AFFINITY=verbose,scatter")
    ISuccess =GETENVQQ("KMP_AFFINITY", val)
    print *,Val
!$OMP PARALLEL
!$OMP END PARALLEL
    print *,'We should see affinity settings on screen!'
    end program Console7

jimdempseyatthecove · ‎10-02-2014

What happens if you perform SETENVQQ("KMP_AFFINITY=verbose,scatter") before call KMP_SET_STACKSIZE_S(1000000)?

The thought being: SETENVQQ is non-OpenMP function whereas KMP_SET_STACKSIZE is an OpenMP function which may have the side effect of instantiating the OpenMP thread pool (before setting the environment variable).

Jim Dempsey

PKM · ‎10-03-2014

Hi Jim

It works as intended if I follow your suggestion and make the calls in reverse order, but isn't this behaviour in conflict with the OpenMP specification? I have always been told that the thread pool is initialized at the first user openMP region, but that could be wrong of course ...

C