- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, I have tested some fortran code, which is composed of calling some subroutine and random number generation, with OpenMP (shown below).
Taking into account conflict among threads, I carefully distinguished the private and shared variables.
When I run the code on Windows machine (with ifort version 14), the scalability is very good (almost linear).
However, when I run the same code on linux machine (with newer ifort version 15), the performance is very poor.
I wonder why such a big difference arises. Are there something I must care about? (compile option, affinity, openMP environment variable settings, for Linux machines, e.g.?)
I compiled with just "ifort /Qopenmp /Qmkl" for Windows and "ifort -qopenmp -mkl" for Linux.
Wall time
Windows 10, Intel Xeon E3-1241 v3 (4 core), Intel Visual Fortran compiler 14.0.3.202
0.03215956 (1 thread)
0.01608227 (2 threads)
0.01093314 (3 threads)
0.00844201 (4 threads)
Windows 10, Intel Xeon Gold 6140 2CPU (18*2 = 36 cores), Intel Visual Fortran compiler 14.0.3.202
0.02839129 (1 thread)
0.02303925 (2 threads)
0.01156269 (4 threads)
0.00576751 (8 threads)
0.00294846 (16 threads)
0.00150034 (32 threads)
Linux(CentOS 6.6), Xeon E5-2695v3 (14 cores), ifort (IFORT) 15.0.3 20150407
0.02155900 (1 thread)
0.01605296 (2 threads)
0.01835489 (4 threads)
0.02340984 (8 threads)
!$ st = omp_get_wtime()
!$omp parallel default(none), &
!$omp firstprivate(pmax,N_tot,dt,M_rs,K1,iCellNumber1,iCellNumber2,iCellNumber3,l1,l2,l3,dx1,dx2,dx3,redx1,redx2,redx3, &
!$omp knudsen), &
!$omp private(sjp,fjp,j,xDp,zetaDp,buf_xD,icoll_event,c_rsp,rsp,dtsub,errcode,int8,c_rs_sum_Mp,iincx,iincy,iincz, &
!$omp irn), &
!$omp shared(xD,zetaD,c_rs,rs,streamR,c_rs_sum_M), num_threads(2)
!$omp do
do p = 1, pmax
!copy the value of shared variables to corresponding private variables (p fixed)
c_rsp = c_rs(p)
rsp(1:M_rs) = rs(1:M_rs,p)
c_rs_sum_Mp = c_rs_sum_M(p)
sjp = 1 + (p-1)*(N_tot/pmax) + min((p-1),mod(N_tot,pmax))
fjp = sjp + N_tot/pmax -1
if( mod(N_tot,pmax) >= p ) fjp = fjp + 1
do j = sjp, fjp
!copy the value of shared variables to corresponding private variables (j fixed)
xDp(1:3) = xD(1:3,j)
zetaDp(1:3) = zetaD(1:3,j)
buf_xD(1:3) = xDp(1:3)
!advection
xDp(1:3) = xDp(1:3) + dt * zetaDp(1:3)
!for each j main computation is carried out, which is composed of a subroutine COMPUTATION and generation of random variables using Intel MKL
do
icoll_event = 0
call COMPUTATION(buf_xD(1),buf_xD(2),buf_xD(3),xDp(1),xDp(2),xDp(3),&
zetaDp(1),zetaDp(2),zetaDp(3),dt,&
iCellNumber1,iCellNumber2,iCellNumber3,l1,l2,l3,dx1,dx2,dx3,redx1,redx2,redx3,&
knudsen(1:iCellNumber1,1:iCellNumber2,1:iCellNumber3),knudsen1,tauObject0,&
K1,rsp(c_rsp+1:c_rsp+K1),icoll_event,dtsub,c_rsp)
!generating random variables if necessary
if( (M_rs-c_rsp) .le. 2*K1 ) then
!shifting unused random variables
do irn = 1, M_rs-c_rsp
rsp(irn) = rsp(irn+c_rsp)
enddo
!generating random variables
errcode = vsrnguniform( method, streamR(p), c_rsp, rsp(M_rs-c_rsp+1:M_rs), 0.0e0, 1.0e0 )
!counting the number of random variables generated so far
int8 = c_rsp
c_rs_sum_Mp = c_rs_sum_Mp + int8
!clearing temporal counter
c_rsp = 0
endif
if(icoll_event==0) exit
end do
!copy back private variables to corresponding adress of shared variables (j fixed)
xD(1:3,j) = xDp(1:3)
zetaD(1:3,j) = zetaDp(1:3)
enddo
!copy back private variables to corresponding adress of shared variables (p fixed)
c_rs(p) = c_rsp
rs(1:M_rs,p) = rsp(1:M_rs)
c_rs_sum_M(p) = c_rs_sum_Mp
enddo
!$omp end do
!$omp end parallel
!$ en = omp_get_wtime()
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code you have shown implies that the parallel region enters only once, and runs for a very short time (10's of ms).
If this is what the test represents, then it is likely that the majority of the time is the OpenMP thread pool instantiation time.
See what happens when you enclose the above code in a loop (say 3 iterations) and record the times for each iteration.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mr. Dempsey,
Thank you for your kind message.
For the previous time, as for the actual code I ran:
(i) it was enclosed by outer loop "n" (corresponding to time step increment) and the wall time was that for the 10-th iteration
n = 0
do
n = n + 1
!$ st = omp_get_wtime()
!$omp parallel default(none), &
!$omp firstprivate(...), &
!$omp private(...), &
!$omp shared(...), num_threads(1, 2, 4, 8, 16, or 32)
!$omp do
........
!$omp end do
!$omp end parallel
!$ en = omp_get_wtime()
if (n == 10) write(*,*) en - st
enddo
(ii) it included other several parts I did not show (which I judged negligble)
In this time, I deleted the superfluous parts (ii) and output the first ten iterations (n = 1, 2, ..., 10).
For n = 1, the result deviates meaningfully from that for other $n$, however,
for later $n$, the circumstances are more or less the same...
Windows 10, Intel Xeon Gold 6140 2CPU (18*2 = 36 cores), Intel Visual Fortran compiler 14.0.3.202
1 thread 8 threads 32 threads
n=1 0.0525 0.0419 0.0141
n=2 0.0364 0.0106 0.0027
n=3 0.0363 0.0091 0.0027
n=4 0.0362 0.0045 0.0026
n=5 0.0362 0.0045 0.0026
n=6 0.0363 0.0045 0.0026
n=7 0.0565 0.0045 0.0026
n=8 0.0371 0.0046 0.0014
n=9 0.0362 0.0046 0.0014
n=10 0.0362 0.0046 0.0014
Linux(CentOS 6.6), Xeon E5-2695v3 (14 cores), ifort (IFORT) 15.0.3 20150407
1 thread 2 threads 8 threads
n=1 0.0217 0.0217 0.0327
n=2 0.0217 0.0175 0.0287
n=3 0.0218 0.0185 0.0284
n=4 0.0217 0.0138 0.0268
n=5 0.0216 0.0133 0.0262
n=6 0.0216 0.0132 0.0266
n=7 0.0217 0.0133 0.0274
n=8 0.0218 0.0133 0.0265
n=9 0.0217 0.0131 0.0269
n=10 0.0217 0.0132 0.0258
Sincerely yours,
Masanari HATTORI
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
On the Linux system, remove num_threads(...) from the !$omp parallel
Then following that add:
if(omp_get_thread_num() == 0) print *,omp_get_num_threads()
You might also perform (prior to run)
export KMP_SETTINGS=TRUE
This will print the OpenMP runtime library environment variables.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Additional information
I notice your program uses MKL
You should be aware that if your program uses OpenMP, .and. if you preponderantly make MKL calls from within parallel regions, that you should link in the serial version of MKL.
Conversely, if your program is serial, or OpenMP parallel with MKL calls from serial regions, then you should link in the parallel version of MKL.
Also, in the case of OpenMP parallel with MKL calls from serial regions you should experiment with setting KMP_BLOCKTIME=0
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mr. Dempsey,
Thank you for the information.
When I did "export KMP_SETTINGS=TRUE", the folllowing is obtained:
User settings:
KMP_SETTINGS=TRUE
Effective settings:
KMP_ABORT_DELAY=0
KMP_ABORT_IF_NO_IRML=false
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=128
KMP_ALL_THREADS=2147483647
KMP_ASAT_DEC=1
KMP_ASAT_FAVOR=0
KMP_ASAT_INC=4
KMP_ASAT_INTERVAL=5
KMP_ASAT_TRIGGER=5000
KMP_ATOMIC_MODE=2
KMP_BLOCKTIME=200
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=3
KMP_GTID_MODE=3
KMP_HANDLE_SIGNALS=false
KMP_HOT_TEAMS_MAX_LEVEL=1
KMP_HOT_TEAMS_MODE=0
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_MONITOR_STACKSIZE: value is not defined
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_STACKOFFSET=64
KMP_STACKPAD=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASK_STEALING_CONSTRAINT=1
KMP_USE_IRML=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_CANCELLATION=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_NESTED=false
OMP_NUM_THREADS: value is not defined
OMP_PLACES: value is not defined
OMP_PROC_BIND='false'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_THREAD_LIMIT=2147483647
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='noverbose,warnings,respect,granularity=core,duplicates,none'
Removing num_threads(...) from the !$omp parallel and adding "if(omp_get_thread_num() == 0) print *,omp_get_num_threads()",
the number 28 was output.
After I tried KMP_BLOCKTIME=0, only KMP_ASAT_INTERVAL changed from 5 to 1000 and the value of other variables were the same.
I examined serial and parallel versions of MKL (-mkl=sequential and -mkl=parallel).
In addition, I examined static linking of MKL (-static-intel), since I found that the default is dynamic link in Linux but static in Windows.
In total, I tried 2*2 = 4 compile options, unfortunately however, the situation does not still improve so much...
Linux(CentOS 6.6), Xeon E5-2695v3 (14 cores), ifort (IFORT) 15.0.3 20150407
ifort -qopenmp -mkl=sequential -static-intel
1 thread 2 threads 4 threads (I confirmed the corresponding numbers 1, 2, or 4 were returned by print *,omp_get_num_threads())
n=1 0.0325 0.0225 0.0285
n=2 0.0217 0.0176 0.0258
n=3 0.0218 0.0177 0.0247
n=4 0.0216 0.0172 0.0267
n=5 0.0216 0.0136 0.0274
n=6 0.0216 0.0125 0.0284
n=7 0.0217 0.0124 0.0274
n=8 0.0217 0.0124 0.0266
n=9 0.0215 0.0125 0.0277
n=10 0.0215 0.0123 0.0280
ifort -qopenmp -mkl=parallel -static-intel
1 thread 2 threads 4 threads
n=1 0.0331 0.0230 0.0238
n=2 0.0217 0.0180 0.0209
n=3 0.0218 0.0179 0.0223
n=4 0.0216 0.0176 0.0208
n=5 0.0215 0.0146 0.0239
n=6 0.0215 0.0168 0.0221
n=7 0.0216 0.0139 0.0220
n=8 0.0216 0.0136 0.0259
n=9 0.0215 0.0136 0.0231
n=10 0.0214 0.0136 0.0249
ifort -qopenmp -mkl=sequential
1 thread 2 threads 4 threads
n=1 0.0384 0.0224 0.0265
n=2 0.0310 0.0175 0.0216
n=3 0.0310 0.0175 0.0244
n=4 0.0265 0.0170 0.0235
n=5 0.0215 0.0124 0.0232
n=6 0.0215 0.0120 0.0257
n=7 0.0215 0.0122 0.0262
n=8 0.0216 0.0121 0.0253
n=9 0.0214 0.0121 0.0218
n=10 0.0214 0.0126 0.0246
ifort -qopenmp -mkl=parallel
1 thread 2 threads 4 threads
n=1 0.0216 0.0208 0.0194
n=2 0.0215 0.0171 0.0174
n=3 0.0216 0.0168 0.0169
n=4 0.0214 0.0170 0.0165
n=5 0.0214 0.0168 0.0171
n=6 0.0214 0.0161 0.0193
n=7 0.0215 0.0121 0.0212
n=8 0.0215 0.0120 0.0231
n=9 0.0214 0.0120 0.0241
n=10 0.0214 0.0120 0.0232
Sincerely yours,
Masanari HATTORI
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When I printed KMP_SETTINGS on Windows machine, the following is obtained:
KMP_ABORT_DELAY=0
KMP_ABORT_IF_NO_IRML=false
KMP_ADAPTIVE_LOCK_PROPS='1,1024'
KMP_ALIGN_ALLOC=64
KMP_ALL_THREADPRIVATE=144
KMP_ALL_THREADS=32768
KMP_ASAT_DEC=1
KMP_ASAT_FAVOR=0
KMP_ASAT_INC=4
KMP_ASAT_INTERVAL=5
KMP_ASAT_TRIGGER=5000
KMP_ATOMIC_MODE=1
KMP_BLOCKTIME=200
KMP_CPUINFO_FILE: value is not defined
KMP_DETERMINISTIC_REDUCTION=false
KMP_DUPLICATE_LIB_OK=false
KMP_FORCE_REDUCTION: value is not defined
KMP_FOREIGN_THREADS_THREADPRIVATE=true
KMP_FORKJOIN_BARRIER='2,2'
KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
KMP_FORKJOIN_FRAMES=true
KMP_FORKJOIN_FRAMES_MODE=0
KMP_GTID_MODE=2
KMP_HANDLE_SIGNALS=false
KMP_INIT_AT_FORK=true
KMP_INIT_WAIT=2048
KMP_ITT_PREPARE_DELAY=0
KMP_LIBRARY=throughput
KMP_LOCK_KIND=queuing
KMP_MALLOC_POOL_INCR=1M
KMP_MONITOR_STACKSIZE: value is not defined
KMP_NEXT_WAIT=1024
KMP_NUM_LOCKS_IN_BLOCK=1
KMP_PLAIN_BARRIER='2,2'
KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
KMP_REDUCTION_BARRIER='1,1'
KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
KMP_SCHEDULE='static,balanced;guided,iterative'
KMP_SETTINGS=true
KMP_STACKOFFSET=0
KMP_STACKSIZE=4M
KMP_STORAGE_MAP=false
KMP_TASKING=2
KMP_TASK_STEALING_CONSTRAINT=1
KMP_USE_IRML=false
KMP_VERSION=false
KMP_WARNINGS=true
OMP_CANCELLATION=false
OMP_DISPLAY_ENV=false
OMP_DYNAMIC=false
OMP_MAX_ACTIVE_LEVELS=2147483647
OMP_NESTED=false
OMP_NUM_THREADS: value is not defined
OMP_PLACES: value is not defined
OMP_PROC_BIND='false'
OMP_SCHEDULE='static'
OMP_STACKSIZE=4M
OMP_THREAD_LIMIT=32768
OMP_WAIT_POLICY=PASSIVE
KMP_AFFINITY='noverbose,warnings,respect,granularity=core,duplicates,none'
Sincerely yours,
Masanari HATTORI
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been using CentOS 7.2 for a few years now on Intel Xeon E5-2620v2 and Intel Xeon Phi KNL 5200 and haven't seen such an issue. Perhaps this may be related to you using CentOS 6.6.
One additional variable that came up last year was a customer of mine was testing a working application on a new platform (Xeon Gold system at vendors location) and reported a dramatic decrease in performance. After some investigation, it was found that the loaner test system was running within a virtualized system. Is your CentOS 6.6 running inside a virtual system? IOW under Hypervisor
Have you been able to use VTune on the CentOS system? (well both systems)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mr. Dempsey,
Thank you for your message.
Our CentOS is not installed on virtual environment but installed on usual hardware.
As for the Vtune, I have not used it so far...
Now I tried another Linux machine (24core) with CentOS 7.4, though some other job is running and uses 100% CPU resources on the machine
(this is the reason why I tested only on one Linux machine so far).
But according to your information about OS, I decided to do test on it.
Up to 16threads, the performance is good (although I can not be confident a little bit since the result is the one with other job being simultaneously running):
So also for Linux no problem if the OS is sufficiently new??
Linux(CentOS 7.4) Xeon Gold 5118 2.3GHz 2CPU (24core)
1thread 2threads 4threads 8threads 16threads 24threads
0.0578 0.0279 0.0226 0.0086 0.0076 0.0331
0.0478 0.0267 0.0136 0.0069 0.0027 0.0050
0.0583 0.0268 0.0138 0.0071 0.0035 0.0052
0.0561 0.0266 0.0133 0.0069 0.0035 0.0049
0.0476 0.0264 0.0135 0.0069 0.0036 0.0042
0.0578 0.0266 0.0137 0.0069 0.0035 0.0049
0.0475 0.0268 0.0134 0.0071 0.0036 0.0047
0.0587 0.0267 0.0139 0.0072 0.0036 0.0043
0.0475 0.0268 0.0132 0.0070 0.0035 0.0043
0.0577 0.0266 0.0066 0.0070 0.0035 0.0044
I experienced in past that for my program, which uses SFMT19937 and vslskipaheadstream,
the intel compiler version newer than 2014 (2014 is OK but 2013 produces error) is necessary. (since these functions are published around 2014?)
The compliler on the former Linux machine I showed is not so old (version 2015).
Do you think the results imply not only the compiler but also the OS should be enough new for the good perfomance?
(CentOS 6.6 was released on Oct. 2014 and 7.4 on Aug. 2017)
Sincerely yours,
Masanari HATTORI
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The optimization performance from 2015 to 2019 should be significant. This is due to architectural differences (read instruction set) between Xeon Scalable (AVX512) and Xeon En series (AVX2). Acquiring a license for 2019 update... provides you access to earlier versions as well e.g. 2017... I do suggest that you run verification test of your software using different optimization levels, different instruction sets, and different versions (should you run into problems with latest version, newest instruction set, highest optimization level).
The scaling curve you presented, maxing out at 16 threads (actual number unknown), may be more of an issue of number of memory channels (2x6=12), or cache (2x16.5=33MB).
Note, CentOS 6.6 predates AVX512. Xeon Gold may be problematic (or incomplete support) on CentOS 6.6.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Mr. Dempsey,
Understood. Thank you for your information and sparing your time for me.
Sincerely yours,
Masanari HATTORI
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page