- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all. The code blow implements an omp threaded iterative algorithm where the threading takes place in subroutine outer. The general idea is that the algorithm works on n consequtive chunks of memory (columns of some arrays) where each threads gets unique columns assigned. Work on the columns is performed using shared data (a large spares matrix and a vector) within subroutine inner. Memory accessed in a "read/write" mode is unique to each thread, some memory accessed in "read-only" mode is shared (eg. the large sparse array). When measuring the processing time for the most outer loop (c0) I found a range from say 25sec to 130sec. Also, the first round is usually the worst with ~250sec. I have made all sorts of trials with omp thread affinity setting and switching on/off hyper-threading, but I could not stableize the processing time per round. If the fluctuation would be in a range of say 10% I would be ok but it is almost 500%. In my test runs the parameter affecting the number of threads (isnch) was set to 36 which is the number of real cores. The computer has 2 sockets with 18 cores each. Hyper-threading doubles the number of cores.
Module Test !$ use omp_lib Type :: tss real(rkdbl), allocatable :: rvvalues(:) integer(ikxl), allocatable :: ivrowpos(:), ivcolpos(:) integer(ikxl) :: isnrows end type tss contains Subroutine outer(isnch,isns,tsc,rmran,rmsa,rmrhs,rmpesq,rvdi,rmtmp,rmrantmp) Implicit None Integer(Ikxl), Intent(in) :: isnch,isns Type(tss), intent(in) :: tsc Real(rkdbl), intent(inout), dimension(:,:) :: rmran,rmsa,rmrhs,rmpesq,rmtmp,rmrantmp Real(rkdbl), intent(in), dimension(:) :: rvdi Integer(Ikxl) :: c0, c1 real(rkdbl) :: t0, t1 !$ integer(ikl) :: isnt=1 !$ isnt=minval((/omp_get_max_threads(),int(isnch,kind=ikl)/)) !$OMP PARALLEL num_threads(isnt) !$OMP FLUSH do c0=1,isns !$OMP SINGLE !$ t0=omp_get_wtime() !$OMP END SINGLE !$OMP DO PRIVATE(c1) Do c1=1,isnch !!generate random numbers in parallel, and write into rmran !!works ok end Do !$OMP END DO !$OMP SINGLE !$ t1=omp_get_wtime() !$ t0=omp_get_wtime() !$OMP END SINGLE !$OMP DO PRIVATE(c1) !!call inner supplying thread specific column vectors of rmran, !!rmsa, rmrhs, rmpesq and rmtmp, shared rvdi and tsc Do c1=1,ISnch call inner(rvran=rmran(:,c1),& &rvsa=rmsa(:,c1),& &rvrhs=rmrhs(:,c1),& &rvpesq=rmpesq(:,c1),& &rvdi=rvdi,& &rvtmp=rmtmp(:,c1),& &rvrantmp=rmrantmp(:,c1),& &tsc=tsc) End Do !$OMP END DO !$OMP SINGLE !$ t1=omp_get_wtime() !$OMP END SINGLE end do !$OMP END PARALLEL End Subroutine Subroutine inner(rvran,rvsa,rvrhs,rvpesq,rvdi,rvtmp,rvrantmp,tsc) !!operates on columns specific to this thread but uses shared !!data TSC and rvdi Implicit None Type(tss), intent(in) :: tsc Real(rkdbl), intent(inout), dimension(:) :: rvran,rvsa,rvrhs& Real(rkdbl), intent(in), dimension(:) :: rvdi Integer(Ikxl) :: c1, iss, ise, isd Real(Rkl) :: rssol, rssa_o Do c1=1,tsc%isnrows rssol=rvrhs(c1)*rvdi(c1) rvpesq(c1)=rvpesq(c1)+rssol**2 rssa_o=rvsa(c1) rvsa(c1)=rssol+rvran(c1) rvrantmp(c1)=rssa_o-rvsa(c1) iss=tsc%ivrowpos(c1)+1 ise=tsc%ivrowpos(c1+1)-1 isd=ise-iss+1 rvtmp(1:isd)=rvrhs(tsc%ivcolpos(iss:ise))+tsc%rvvalues(iss:ise)*rvrantmp(c1) rvrhs(tsc%ivcolpos(iss:ise))=rvtmp(1:isd) End do Do c1=tsc%isnrows-1,1,-1 iss=tsc%ivrowpos(c1)+1;ise=tsc%ivrowpos(c1+1)-1 rvrhs(c1)=rvrhs(c1)+& &sum(tsc%rvvalues(iss:ise)*& &(rvrantmp(tsc%ivcolpos(iss:ise)))) end do End Subroutine inner end Module Test
My current affinity setting are
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined OPENMP DISPLAY ENVIRONMENT BEGIN _OPENMP='201511' [host] OMP_CANCELLATION='FALSE' [host] OMP_DEFAULT_DEVICE='0' [host] OMP_DISPLAY_ENV='TRUE' [host] OMP_DYNAMIC='FALSE' [host] OMP_MAX_ACTIVE_LEVELS='2147483647' [host] OMP_MAX_TASK_PRIORITY='0' [host] OMP_NESTED='TRUE' [host] OMP_NUM_THREADS='72' [host] OMP_PLACES: value is not defined [host] OMP_PROC_BIND='intel' [host] OMP_SCHEDULE='static' [host] OMP_STACKSIZE='2000M' [host] OMP_THREAD_LIMIT='72' [host] OMP_WAIT_POLICY='PASSIVE' OPENMP DISPLAY ENVIRONMENT END
with kmp_affinity set to
export KMP_AFFINITY=verbose,granularity=fine,compact,1,0
The output of "verbose" when hyperthreading was switched off was:
OMP: Info #209: KMP_AFFINITY: decoding x2APIC ids. OMP: Info #207: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35} OMP: Info #156: KMP_AFFINITY: 36 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 18 cores/pkg x 1 threads/core (36 total cores) OMP: Info #211: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 0 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 0 core 3 OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 4 OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 0 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 0 core 10 OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 11 OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 0 core 16 OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 17 OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 0 core 18 OMP: Info #171: KMP_AFFINITY: OS proc 12 maps to package 0 core 19 OMP: Info #171: KMP_AFFINITY: OS proc 13 maps to package 0 core 20 OMP: Info #171: KMP_AFFINITY: OS proc 14 maps to package 0 core 24 OMP: Info #171: KMP_AFFINITY: OS proc 15 maps to package 0 core 25 OMP: Info #171: KMP_AFFINITY: OS proc 16 maps to package 0 core 26 OMP: Info #171: KMP_AFFINITY: OS proc 17 maps to package 0 core 27 OMP: Info #171: KMP_AFFINITY: OS proc 18 maps to package 1 core 0 OMP: Info #171: KMP_AFFINITY: OS proc 19 maps to package 1 core 1 OMP: Info #171: KMP_AFFINITY: OS proc 20 maps to package 1 core 2 OMP: Info #171: KMP_AFFINITY: OS proc 21 maps to package 1 core 3 OMP: Info #171: KMP_AFFINITY: OS proc 22 maps to package 1 core 4 OMP: Info #171: KMP_AFFINITY: OS proc 23 maps to package 1 core 8 OMP: Info #171: KMP_AFFINITY: OS proc 24 maps to package 1 core 9 OMP: Info #171: KMP_AFFINITY: OS proc 25 maps to package 1 core 10 OMP: Info #171: KMP_AFFINITY: OS proc 26 maps to package 1 core 11 OMP: Info #171: KMP_AFFINITY: OS proc 27 maps to package 1 core 16 OMP: Info #171: KMP_AFFINITY: OS proc 28 maps to package 1 core 17 OMP: Info #171: KMP_AFFINITY: OS proc 29 maps to package 1 core 18 OMP: Info #171: KMP_AFFINITY: OS proc 30 maps to package 1 core 19 OMP: Info #171: KMP_AFFINITY: OS proc 31 maps to package 1 core 20 OMP: Info #171: KMP_AFFINITY: OS proc 32 maps to package 1 core 24 OMP: Info #171: KMP_AFFINITY: OS proc 33 maps to package 1 core 25 OMP: Info #171: KMP_AFFINITY: OS proc 34 maps to package 1 core 26 OMP: Info #171: KMP_AFFINITY: OS proc 35 maps to package 1 core 27 OMP: Info #247: KMP_AFFINITY: pid 1920 tid 1920 thread 0 bound to OS proc set {0} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2013 thread 1 bound to OS proc set {18} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2014 thread 2 bound to OS proc set {1} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2015 thread 3 bound to OS proc set {19} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2016 thread 4 bound to OS proc set {2} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2017 thread 5 bound to OS proc set {20} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2018 thread 6 bound to OS proc set {3} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2019 thread 7 bound to OS proc set {21} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2020 thread 8 bound to OS proc set {4} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2021 thread 9 bound to OS proc set {22} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2022 thread 10 bound to OS proc set {5} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2023 thread 11 bound to OS proc set {23} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2024 thread 12 bound to OS proc set {6} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2026 thread 14 bound to OS proc set {7} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2025 thread 13 bound to OS proc set {24} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2027 thread 15 bound to OS proc set {25} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2028 thread 16 bound to OS proc set {8} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2029 thread 17 bound to OS proc set {26} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2030 thread 18 bound to OS proc set {9} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2031 thread 19 bound to OS proc set {27} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2032 thread 20 bound to OS proc set {10} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2033 thread 21 bound to OS proc set {28} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2034 thread 22 bound to OS proc set {11} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2036 thread 24 bound to OS proc set {12} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2035 thread 23 bound to OS proc set {29} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2037 thread 25 bound to OS proc set {30} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2038 thread 26 bound to OS proc set {13} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2039 thread 27 bound to OS proc set {31} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2040 thread 28 bound to OS proc set {14} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2041 thread 29 bound to OS proc set {32} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2042 thread 30 bound to OS proc set {15} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2043 thread 31 bound to OS proc set {33} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2044 thread 32 bound to OS proc set {16} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2045 thread 33 bound to OS proc set {34} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2046 thread 34 bound to OS proc set {17} OMP: Info #247: KMP_AFFINITY: pid 1920 tid 2047 thread 35 bound to OS proc set {35}
However, I have also made tests with hyperthreading switched on, using no kmp_affinity but using omp settings like:
export OMP_THREAD_LIMIT=72 export OMP_STACKSIZE=2000M export OMP_DYNAMIC=FALSE export OMP_PLACES=cores export OMP_PROC_BIND=spread export OMP_DISPLAY_ENV=true export OMP_NESTED=true
but the fluctuation in processing time was as badly as described above.
Any suggestion of what I am doing wrong are highly appreciated.
Note that I cannot give a local copy of TSC to each thread due to memory restrictions.
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the value of tsc%isnrows?
Why do you have OMP_STACKSIZE at 2 GB?
36 * 2GB = 72GB for stack assuming not nested
Why do you have OMP_NESTED=true?
Is outer called from within a parallel region???
36 * 36 * 2GB if one nest level...
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your application requires such an excessive value of OMP_STACKSIZE, that appears to be a problem. I've never seen one which ran well with more than 45M. Default of 4M (in 64-bit mode) would be sufficient if you don't allocate significant memory inside the threaded region. Even then, with 36 threads and single level parallel, you would be tying up 144MB local to the parallel region, and potentially allocating and deallocating as you enter and leave. It's hard to read the code to check whether you have automatic arrays; if not, you should be OK with that default. I have seen cases where reducing to OMP_STACKSIZE=2M showed benefit. Rather than automatic arrays, should you have any, it seems preferable to use allocatable for error checking as well as to see what is going on.
f you set OMP_PLACES=cores (an important experiment) , I believe you must set NUM_THREADS to number of cores to assure that each CPU gets half the threads. In order to try a smaller number of threads, you would need to specify placement individually (e.g. by setting KMP_AFFINITY or OMP_PROC_BIND with appropriate skip factors). You should be able to set KMP_AFFINITY=verbose, to check the mapping, without overriding OMP_PLACES.
In the map above, you have even threads on one CPU and odd on the other, if I'm reading it right. As you appear to depend strongly on accessing shared memory, this would prevent adjacent threads from splitting the cache effectively. As you say, that would be an affinity problem. Perhaps, if you have the problem with excessive OMP_STACKSIZE, this could contribute to variability in performance. By itself, it should be more consistent than what you observe, as you have affinities set and are using default scheduling.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim,
tsc is the name of the sparse matrix container, isnrows is the number of rows of the sparse matrix.
omp_stacksize is an outcome a discussion I had here a year ago where applications were crashing when threaded (https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/720999, #22)
omp_nested is a hang-over from other applications ............. it is just the general bashrc setting .............. sorry for the confusion. I am aware that there is nothing to nest here.
No "outer" is NOT called from within a parallel version. For the complete application, parallelization happens only in the code given above.
Let me know if you need more information.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
as pointed out above, the omp_stacksize was a recommendation from intel. Till recently I have seen applications crashing with smaller omp_stacksize values.
Sorry for the long code.
From my understanding there should be absolutely no automatic arrays. All memory is allocated before entering outer.
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>as pointed out above, the omp_stacksize was a recommendation from intel
I cannot imagine Intel suggesting you use 2GB
>>Till recently I have seen applications crashing with smaller omp_stacksize values.
Each application may have different stacksize requirements.
>>tsc is the name of the sparse matrix container, isnrows is the number of rows of the sparse matrix
That tells me what it is, but does not tell me what the typical value(s) is(are).
If this(these) values are relatively small, then fewer threads may be better than more threads.
I agree with TimP's prognosis of thread placement, although I might add that for a specific problem, that following a general rule might not yield the most efficient implementation. e.g.
Is it better to utilize the L3 cache of one CPU(socket) or to distribute it amongst multiple CPUs(sockets)?
Is it better (when two cores share an L2) to use 1 core in the L2 or both cores in the L2?
Within the above considerations, is it better to use the number of threads that evenly distributes the workload or the most available threads?
The answers to these can only be determined with experimentation.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Jim
for the stack size:
Martyn Corden (Intel) wrote:
But most important, you need a much larger value of the thread stack size. I was able to build and run both variants successfully with 8 threads and OMP_STACKSIZE=5000M . I didn't try to determine optimum values.
Sure every application differs.
The size of "rvvalues" in my application was 3Bio. Same for "ivcolpos". "ivrowpos" was about 60Mio. The size of the whole system on disk about 60GB. All threads specific(non-shared) vectors entering "inner" are of the same size 60Mio.
I'll do some more experiments to answer the other questions.
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>vectors entering "inner" are of the same size 60Mio
So to be clear, in inner, tsc%nrows is 60000000. Or 480MB array in inner loop, per thread. Clearly exceeding the size of L3 cache for 1 thread let alone 36 threads.
The loops shown in inner contain a small amount of computation with respect to the number of loads and stores. For memory bound types of computation you might wish to restrict the number of threads on each socket to that of number of memory channels or some small number above that.
For example, if your system has 2 Xeon Gold 6150's, each with: 18 cores, 36 threads, 6 memory channels. 12 memory channels in total. Consider experimenting using
KMP_AFFINITY=scatter
OMP_NUM_THREADS=12
See how 12 does, then try 18 and 24 threads.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page