large variation in prossessing time for an omp threaded iterative application (omp thread affinity problem?)

may_ka · ‎10-24-2018

Hi all. The code blow implements an omp threaded iterative algorithm where the threading takes place in subroutine outer. The general idea is that the algorithm works on n consequtive chunks of memory (columns of some arrays) where each threads gets unique columns assigned. Work on the columns is performed using shared data (a large spares matrix and a vector) within subroutine inner. Memory accessed in a "read/write" mode is unique to each thread, some memory accessed in "read-only" mode is shared (eg. the large sparse array). When measuring the processing time for the most outer loop (c0) I found a range from say 25sec to 130sec. Also, the first round is usually the worst with ~250sec. I have made all sorts of trials with omp thread affinity setting and switching on/off hyper-threading, but I could not stableize the processing time per round. If the fluctuation would be in a range of say 10% I would be ok but it is almost 500%. In my test runs the parameter affecting the number of threads (isnch) was set to 36 which is the number of real cores. The computer has 2 sockets with 18 cores each. Hyper-threading doubles the number of cores.

Module Test
  !$ use omp_lib
  Type :: tss
    real(rkdbl), allocatable :: rvvalues(:)
    integer(ikxl), allocatable :: ivrowpos(:), ivcolpos(:)
    integer(ikxl) :: isnrows
  end type tss
contains
  Subroutine outer(isnch,isns,tsc,rmran,rmsa,rmrhs,rmpesq,rvdi,rmtmp,rmrantmp)
    Implicit None
    Integer(Ikxl), Intent(in) :: isnch,isns
    Type(tss), intent(in) :: tsc
    Real(rkdbl), intent(inout), dimension(:,:) :: rmran,rmsa,rmrhs,rmpesq,rmtmp,rmrantmp
    Real(rkdbl), intent(in), dimension(:) :: rvdi
    Integer(Ikxl) :: c0, c1
    real(rkdbl) :: t0, t1
    !$ integer(ikl) :: isnt=1
    !$ isnt=minval((/omp_get_max_threads(),int(isnch,kind=ikl)/))
    !$OMP PARALLEL num_threads(isnt)
    !$OMP FLUSH
    do c0=1,isns
      !$OMP SINGLE
      !$ t0=omp_get_wtime()
      !$OMP END SINGLE
      !$OMP DO PRIVATE(c1)
      Do c1=1,isnch
        !!generate random numbers in parallel, and write into rmran
        !!works ok
      end Do
      !$OMP END DO
      !$OMP SINGLE
      !$ t1=omp_get_wtime()
      !$ t0=omp_get_wtime()
      !$OMP END SINGLE
      !$OMP DO PRIVATE(c1)
      !!call inner supplying thread specific column vectors of rmran,
      !!rmsa, rmrhs, rmpesq and rmtmp, shared rvdi and tsc
      Do c1=1,ISnch
        call inner(rvran=rmran(:,c1),&
          &rvsa=rmsa(:,c1),&
          &rvrhs=rmrhs(:,c1),&
          &rvpesq=rmpesq(:,c1),&
          &rvdi=rvdi,&
          &rvtmp=rmtmp(:,c1),&
          &rvrantmp=rmrantmp(:,c1),&
          &tsc=tsc)
      End Do
      !$OMP END DO
      !$OMP SINGLE
      !$ t1=omp_get_wtime()
      !$OMP END SINGLE
    end do
    !$OMP END PARALLEL
  End Subroutine
  Subroutine inner(rvran,rvsa,rvrhs,rvpesq,rvdi,rvtmp,rvrantmp,tsc)
    !!operates on columns specific to this thread but uses shared
    !!data TSC and rvdi
    Implicit None
    Type(tss), intent(in) :: tsc
    Real(rkdbl), intent(inout), dimension(:) :: rvran,rvsa,rvrhs&
    Real(rkdbl), intent(in), dimension(:) :: rvdi
    Integer(Ikxl) :: c1, iss, ise, isd
    Real(Rkl) :: rssol, rssa_o
    Do c1=1,tsc%isnrows
      rssol=rvrhs(c1)*rvdi(c1)
      rvpesq(c1)=rvpesq(c1)+rssol**2
      rssa_o=rvsa(c1)
      rvsa(c1)=rssol+rvran(c1)
      rvrantmp(c1)=rssa_o-rvsa(c1)
      iss=tsc%ivrowpos(c1)+1
      ise=tsc%ivrowpos(c1+1)-1
      isd=ise-iss+1
      rvtmp(1:isd)=rvrhs(tsc%ivcolpos(iss:ise))+tsc%rvvalues(iss:ise)*rvrantmp(c1)
      rvrhs(tsc%ivcolpos(iss:ise))=rvtmp(1:isd)
    End do
    Do c1=tsc%isnrows-1,1,-1
      iss=tsc%ivrowpos(c1)+1;ise=tsc%ivrowpos(c1+1)-1
      rvrhs(c1)=rvrhs(c1)+&
        &sum(tsc%rvvalues(iss:ise)*&
        &(rvrantmp(tsc%ivcolpos(iss:ise))))
    end do
  End Subroutine inner
end Module Test

My current affinity setting are

OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201511'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEFAULT_DEVICE='0'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='TRUE'
  [host] OMP_NUM_THREADS='72'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='2000M'
  [host] OMP_THREAD_LIMIT='72'
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END

with kmp_affinity set to

export KMP_AFFINITY=verbose,granularity=fine,compact,1,0

The output of "verbose" when hyperthreading was switched off was:

OMP: Info #209: KMP_AFFINITY: decoding x2APIC ids.
OMP: Info #207: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35}
OMP: Info #156: KMP_AFFINITY: 36 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 18 cores/pkg x 1 threads/core (36 total cores)
OMP: Info #211: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 16 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 17 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 18 
OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 19 
OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 20 
OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 24 
OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 25 
OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 26 
OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 27 
OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 3 
OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 4 
OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 11 
OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 16 
OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 17 
OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 18 
OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 19 
OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 20 
OMP: Info #171: KMP_AFFINITY: OS proc 32 maps to package 1 core 24 
OMP: Info #171: KMP_AFFINITY: OS proc 33 maps to package 1 core 25 
OMP: Info #171: KMP_AFFINITY: OS proc 34 maps to package 1 core 26 
OMP: Info #171: KMP_AFFINITY: OS proc 35 maps to package 1 core 27 
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 1920 thread 0 bound to OS proc set {0}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2013 thread 1 bound to OS proc set {18}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2014 thread 2 bound to OS proc set {1}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2015 thread 3 bound to OS proc set {19}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2016 thread 4 bound to OS proc set {2}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2017 thread 5 bound to OS proc set {20}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2018 thread 6 bound to OS proc set {3}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2019 thread 7 bound to OS proc set {21}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2020 thread 8 bound to OS proc set {4}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2021 thread 9 bound to OS proc set {22}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2022 thread 10 bound to OS proc set {5}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2023 thread 11 bound to OS proc set {23}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2024 thread 12 bound to OS proc set {6}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2026 thread 14 bound to OS proc set {7}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2025 thread 13 bound to OS proc set {24}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2027 thread 15 bound to OS proc set {25}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2028 thread 16 bound to OS proc set {8}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2029 thread 17 bound to OS proc set {26}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2030 thread 18 bound to OS proc set {9}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2031 thread 19 bound to OS proc set {27}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2032 thread 20 bound to OS proc set {10}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2033 thread 21 bound to OS proc set {28}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2034 thread 22 bound to OS proc set {11}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2036 thread 24 bound to OS proc set {12}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2035 thread 23 bound to OS proc set {29}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2037 thread 25 bound to OS proc set {30}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2038 thread 26 bound to OS proc set {13}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2039 thread 27 bound to OS proc set {31}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2040 thread 28 bound to OS proc set {14}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2041 thread 29 bound to OS proc set {32}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2042 thread 30 bound to OS proc set {15}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2043 thread 31 bound to OS proc set {33}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2044 thread 32 bound to OS proc set {16}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2045 thread 33 bound to OS proc set {34}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2046 thread 34 bound to OS proc set {17}
OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2047 thread 35 bound to OS proc set {35}

However, I have also made tests with hyperthreading switched on, using no kmp_affinity but using omp settings like:

export OMP_THREAD_LIMIT=72
export OMP_STACKSIZE=2000M
export OMP_DYNAMIC=FALSE
export OMP_PLACES=cores
export OMP_PROC_BIND=spread
export OMP_DISPLAY_ENV=true
export OMP_NESTED=true

but the fluctuation in processing time was as badly as described above.

Any suggestion of what I am doing wrong are highly appreciated.

Note that I cannot give a local copy of TSC to each thread due to memory restrictions.

Thanks

jimdempseyatthecove · ‎10-25-2018

What is the value of tsc%isnrows?

Why do you have OMP_STACKSIZE at 2 GB?
36 * 2GB = 72GB for stack assuming not nested

Why do you have OMP_NESTED=true?
Is outer called from within a parallel region???

36 * 36 * 2GB if one nest level...

Jim Dempsey

TimP · ‎10-25-2018

If your application requires such an excessive value of OMP_STACKSIZE, that appears to be a problem. I've never seen one which ran well with more than 45M. Default of 4M (in 64-bit mode) would be sufficient if you don't allocate significant memory inside the threaded region. Even then, with 36 threads and single level parallel, you would be tying up 144MB local to the parallel region, and potentially allocating and deallocating as you enter and leave. It's hard to read the code to check whether you have automatic arrays; if not, you should be OK with that default. I have seen cases where reducing to OMP_STACKSIZE=2M showed benefit. Rather than automatic arrays, should you have any, it seems preferable to use allocatable for error checking as well as to see what is going on.

f you set OMP_PLACES=cores (an important experiment) , I believe you must set NUM_THREADS to number of cores to assure that each CPU gets half the threads. In order to try a smaller number of threads, you would need to specify placement individually (e.g. by setting KMP_AFFINITY or OMP_PROC_BIND with appropriate skip factors). You should be able to set KMP_AFFINITY=verbose, to check the mapping, without overriding OMP_PLACES.

In the map above, you have even threads on one CPU and odd on the other, if I'm reading it right. As you appear to depend strongly on accessing shared memory, this would prevent adjacent threads from splitting the cache effectively. As you say, that would be an affinity problem. Perhaps, if you have the problem with excessive OMP_STACKSIZE, this could contribute to variability in performance. By itself, it should be more consistent than what you observe, as you have affinities set and are using default scheduling.

may_ka · ‎10-25-2018

Hi Jim,

tsc is the name of the sparse matrix container, isnrows is the number of rows of the sparse matrix.

omp_stacksize is an outcome a discussion I had here a year ago where applications were crashing when threaded (https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/720999, #22)

omp_nested is a hang-over from other applications ............. it is just the general bashrc setting .............. sorry for the confusion. I am aware that there is nothing to nest here.

No "outer" is NOT called from within a parallel version. For the complete application, parallelization happens only in the code given above.

Let me know if you need more information.

Thanks

may_ka · ‎10-25-2018

Hi Tim,

as pointed out above, the omp_stacksize was a recommendation from intel. Till recently I have seen applications crashing with smaller omp_stacksize values.

Sorry for the long code.

From my understanding there should be absolutely no automatic arrays. All memory is allocated before entering outer.

Cheers

jimdempseyatthecove · ‎10-25-2018

>>as pointed out above, the omp_stacksize was a recommendation from intel
I cannot imagine Intel suggesting you use 2GB

>>Till recently I have seen applications crashing with smaller omp_stacksize values.
Each application may have different stacksize requirements.

>>tsc is the name of the sparse matrix container, isnrows is the number of rows of the sparse matrix

That tells me what it is, but does not tell me what the typical value(s) is(are).

If this(these) values are relatively small, then fewer threads may be better than more threads.

I agree with TimP's prognosis of thread placement, although I might add that for a specific problem, that following a general rule might not yield the most efficient implementation. e.g.

Is it better to utilize the L3 cache of one CPU(socket) or to distribute it amongst multiple CPUs(sockets)?
Is it better (when two cores share an L2) to use 1 core in the L2 or both cores in the L2?
Within the above considerations, is it better to use the number of threads that evenly distributes the workload or the most available threads?

The answers to these can only be determined with experimentation.

Jim Dempsey

may_ka · ‎10-25-2018

Hi Jim

for the stack size:

Martyn Corden (Intel) wrote:

But most important, you need a much larger value of the thread stack size. I was able to build and run both variants successfully with 8 threads and OMP_STACKSIZE=5000M . I didn't try to determine optimum values.

Sure every application differs.

The size of "rvvalues" in my application was 3Bio. Same for "ivcolpos". "ivrowpos" was about 60Mio. The size of the whole system on disk about 60GB. All threads specific(non-shared) vectors entering "inner" are of the same size 60Mio.

I'll do some more experiments to answer the other questions.

Cheers

jimdempseyatthecove · ‎10-25-2018

>>vectors entering "inner" are of the same size 60Mio

So to be clear, in inner, tsc%nrows is 60000000. Or 480MB array in inner loop, per thread. Clearly exceeding the size of L3 cache for 1 thread let alone 36 threads.

The loops shown in inner contain a small amount of computation with respect to the number of loads and stores. For memory bound types of computation you might wish to restrict the number of threads on each socket to that of number of memory channels or some small number above that.

For example, if your system has 2 Xeon Gold 6150's, each with: 18 cores, 36 threads, 6 memory channels. 12 memory channels in total. Consider experimenting using

KMP_AFFINITY=scatter
OMP_NUM_THREADS=12

See how 12 does, then try 18 and 24 threads.