topic OpenMP performance completely differs between Windows and Linux machines in Intel® Moderncode for Parallel Architectures

OpenMP performance completely differs between Windows and Linux machines

Hattori__Masanari — Thu, 23 May 2019 11:54:39 GMT

Hello, I have tested some fortran code, which is composed of calling some subroutine and random number generation, with OpenMP (shown below).
Taking into account conflict among threads, I carefully distinguished the private and shared variables.

When I run the code on Windows machine (with ifort version 14), the scalability is very good (almost linear).

However, when I run the same code on linux machine (with newer ifort version 15), the performance is very poor.

I wonder why such a big difference arises. Are there something I must care about? (compile option, affinity, openMP environment variable settings, for Linux machines, e.g.?)

I compiled with just "ifort /Qopenmp /Qmkl" for Windows and "ifort -qopenmp -mkl" for Linux.

Wall time

Windows 10, Intel Xeon E3-1241 v3 (4 core), Intel Visual Fortran compiler 14.0.3.202
0.03215956 (1 thread)
0.01608227 (2 threads)
0.01093314 (3 threads)
0.00844201 (4 threads)

Windows 10, Intel Xeon Gold 6140 2CPU (18*2 = 36 cores), Intel Visual Fortran compiler 14.0.3.202
0.02839129 (1 thread)
0.02303925 (2 threads)
0.01156269 (4 threads)
0.00576751 (8 threads)
0.00294846 (16 threads)
0.00150034 (32 threads)

Linux(CentOS 6.6), Xeon E5-2695v3 (14 cores), ifort (IFORT) 15.0.3 20150407
0.02155900 (1 thread)
0.01605296 (2 threads)
0.01835489 (4 threads)
0.02340984 (8 threads)

    !$ st = omp_get_wtime()
      !$omp parallel default(none), &
      !$omp firstprivate(pmax,N_tot,dt,M_rs,K1,iCellNumber1,iCellNumber2,iCellNumber3,l1,l2,l3,dx1,dx2,dx3,redx1,redx2,redx3, &
      !$omp              knudsen), &
      !$omp private(sjp,fjp,j,xDp,zetaDp,buf_xD,icoll_event,c_rsp,rsp,dtsub,errcode,int8,c_rs_sum_Mp,iincx,iincy,iincz, &
      !$omp         irn), &
      !$omp shared(xD,zetaD,c_rs,rs,streamR,c_rs_sum_M), num_threads(2)
      !$omp do
      do p = 1, pmax

        !copy the value of shared variables to corresponding private variables (p fixed)
        c_rsp = c_rs(p)
        rsp(1:M_rs) = rs(1:M_rs,p)
        c_rs_sum_Mp = c_rs_sum_M(p)

        sjp = 1 + (p-1)*(N_tot/pmax) + min((p-1),mod(N_tot,pmax))
        fjp = sjp + N_tot/pmax -1
        if( mod(N_tot,pmax) >= p ) fjp = fjp + 1
        do j = sjp, fjp

          !copy the value of shared variables to corresponding private variables (j fixed)
          xDp(1:3) = xD(1:3,j)
          zetaDp(1:3) = zetaD(1:3,j)

buf_xD(1:3) = xDp(1:3)

!advection
xDp(1:3) = xDp(1:3) + dt * zetaDp(1:3)

          !for each j main computation is carried out, which is composed of a subroutine COMPUTATION and generation of random variables using Intel MKL
          do
            icoll_event = 0

            call COMPUTATION(buf_xD(1),buf_xD(2),buf_xD(3),xDp(1),xDp(2),xDp(3),&
                            zetaDp(1),zetaDp(2),zetaDp(3),dt,&
                            iCellNumber1,iCellNumber2,iCellNumber3,l1,l2,l3,dx1,dx2,dx3,redx1,redx2,redx3,&
                            knudsen(1:iCellNumber1,1:iCellNumber2,1:iCellNumber3),knudsen1,tauObject0,&
                            K1,rsp(c_rsp+1:c_rsp+K1),icoll_event,dtsub,c_rsp)

            !generating random variables if necessary
            if( (M_rs-c_rsp) .le. 2*K1 ) then
              !shifting unused random variables
              do irn = 1, M_rs-c_rsp
                rsp(irn) = rsp(irn+c_rsp)
              enddo

!generating random variables
errcode = vsrnguniform( method, streamR(p), c_rsp, rsp(M_rs-c_rsp+1:M_rs), 0.0e0, 1.0e0 )

              !counting the number of random variables generated so far
              int8 = c_rsp
              c_rs_sum_Mp = c_rs_sum_Mp + int8

              !clearing temporal counter
              c_rsp = 0
            endif

if(icoll_event==0) exit

end do

          !copy back private variables to corresponding adress of shared variables (j fixed)
          xD(1:3,j) = xDp(1:3)
          zetaD(1:3,j) = zetaDp(1:3)

enddo

          !copy back private variables to corresponding adress of shared variables (p fixed)
        c_rs(p) = c_rsp
        rs(1:M_rs,p) = rsp(1:M_rs)
        c_rs_sum_M(p) = c_rs_sum_Mp

      enddo
      !$omp end do
      !$omp end parallel
    !$ en = omp_get_wtime()

The code you have shown

jimdempseyatthecove — Fri, 24 May 2019 12:45:14 GMT

The code you have shown implies that the parallel region enters only once, and runs for a very short time (10's of ms).

If this is what the test represents, then it is likely that the majority of the time is the OpenMP thread pool instantiation time.

See what happens when you enclose the above code in a loop (say 3 iterations) and record the times for each iteration.

Jim Dempsey

Dear Mr. Dempsey,

Hattori__Masanari — Mon, 27 May 2019 01:40:09 GMT

Dear Mr. Dempsey,

Thank you for your kind message.

For the previous time, as for the actual code I ran:

(i) it was enclosed by outer loop "n" (corresponding to time step increment) and the wall time was that for the 10-th iteration

     n = 0
     do
       n = n + 1
       !$ st = omp_get_wtime()
       !$omp parallel default(none), &
       !$omp firstprivate(...), &
       !$omp private(...), &
       !$omp shared(...), num_threads(1, 2, 4, 8, 16, or 32)
       !$omp do
       ........
       !$omp end do
       !$omp end parallel
       !$ en = omp_get_wtime()
       if (n == 10) write(*,*) en - st
     enddo

(ii) it included other several parts I did not show (which I judged negligble)

In this time, I deleted the superfluous parts (ii) and output the first ten iterations (n = 1, 2, ..., 10).
For n = 1, the result deviates meaningfully from that for other $n$, however,
for later $n$, the circumstances are more or less the same...

    Windows 10, Intel Xeon Gold 6140 2CPU (18*2 = 36 cores), Intel Visual Fortran compiler 14.0.3.202
           1 thread     8 threads     32 threads
    n=1    0.0525   0.0419    0.0141
    n=2    0.0364   0.0106    0.0027
    n=3    0.0363   0.0091    0.0027
    n=4    0.0362   0.0045    0.0026
    n=5    0.0362   0.0045    0.0026
    n=6    0.0363   0.0045    0.0026
    n=7    0.0565   0.0045    0.0026
    n=8    0.0371   0.0046    0.0014
    n=9    0.0362   0.0046    0.0014
    n=10   0.0362   0.0046    0.0014

    Linux(CentOS 6.6), Xeon E5-2695v3 (14 cores), ifort (IFORT) 15.0.3 20150407
           1 thread     2 threads     8 threads
    n=1    0.0217   0.0217    0.0327
    n=2    0.0217   0.0175    0.0287
    n=3    0.0218   0.0185    0.0284
    n=4    0.0217   0.0138    0.0268
    n=5    0.0216   0.0133    0.0262
    n=6    0.0216   0.0132    0.0266
    n=7    0.0217   0.0133    0.0274
    n=8    0.0218   0.0133    0.0265
    n=9    0.0217   0.0131    0.0269
    n=10   0.0217   0.0132    0.0258

Sincerely yours,

Masanari HATTORI

On the Linux system, remove

jimdempseyatthecove — Tue, 28 May 2019 14:37:12 GMT

On the Linux system, remove num_threads(...) from the !$omp parallel
Then following that add:

if(omp_get_thread_num() == 0) print *,omp_get_num_threads()

You might also perform (prior to run)

export KMP_SETTINGS=TRUE

This will print the OpenMP runtime library environment variables.

Jim Dempsey

Additional information

jimdempseyatthecove — Tue, 28 May 2019 14:45:41 GMT

Additional information

I notice your program uses MKL

You should be aware that if your program uses OpenMP, .and. if you preponderantly make MKL calls from within parallel regions, that you should link in the serial version of MKL.

Conversely, if your program is serial, or OpenMP parallel with MKL calls from serial regions, then you should link in the parallel version of MKL.

Also, in the case of OpenMP parallel with MKL calls from serial regions you should experiment with setting KMP_BLOCKTIME=0

Jim Dempsey

Dear Mr. Dempsey,

Hattori__Masanari — Thu, 30 May 2019 05:32:10 GMT

Dear Mr. Dempsey,

Thank you for the information.

When I did "export KMP_SETTINGS=TRUE", the folllowing is obtained:

User settings:

     KMP_SETTINGS=TRUE

Effective settings:

     KMP_ABORT_DELAY=0
     KMP_ABORT_IF_NO_IRML=false
     KMP_ADAPTIVE_LOCK_PROPS='1,1024'
     KMP_ALIGN_ALLOC=64
     KMP_ALL_THREADPRIVATE=128
     KMP_ALL_THREADS=2147483647
     KMP_ASAT_DEC=1
     KMP_ASAT_FAVOR=0
     KMP_ASAT_INC=4
     KMP_ASAT_INTERVAL=5
     KMP_ASAT_TRIGGER=5000
     KMP_ATOMIC_MODE=2
     KMP_BLOCKTIME=200
     KMP_CPUINFO_FILE: value is not defined
     KMP_DETERMINISTIC_REDUCTION=false
     KMP_DUPLICATE_LIB_OK=false
     KMP_FORCE_REDUCTION: value is not defined
     KMP_FOREIGN_THREADS_THREADPRIVATE=true
     KMP_FORKJOIN_BARRIER='2,2'
     KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
     KMP_FORKJOIN_FRAMES=true
     KMP_FORKJOIN_FRAMES_MODE=3
     KMP_GTID_MODE=3
     KMP_HANDLE_SIGNALS=false
     KMP_HOT_TEAMS_MAX_LEVEL=1
     KMP_HOT_TEAMS_MODE=0
     KMP_INIT_AT_FORK=true
     KMP_INIT_WAIT=2048
     KMP_ITT_PREPARE_DELAY=0
     KMP_LIBRARY=throughput
     KMP_LOCK_KIND=queuing
     KMP_MALLOC_POOL_INCR=1M
     KMP_MONITOR_STACKSIZE: value is not defined
     KMP_NEXT_WAIT=1024
     KMP_NUM_LOCKS_IN_BLOCK=1
     KMP_PLAIN_BARRIER='2,2'
     KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
     KMP_REDUCTION_BARRIER='1,1'
     KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
     KMP_SCHEDULE='static,balanced;guided,iterative'
     KMP_SETTINGS=true
     KMP_STACKOFFSET=64
     KMP_STACKPAD=0
     KMP_STACKSIZE=4M
     KMP_STORAGE_MAP=false
     KMP_TASKING=2
     KMP_TASK_STEALING_CONSTRAINT=1
     KMP_USE_IRML=false
     KMP_VERSION=false
     KMP_WARNINGS=true
     OMP_CANCELLATION=false
     OMP_DISPLAY_ENV=false
     OMP_DYNAMIC=false
     OMP_MAX_ACTIVE_LEVELS=2147483647
     OMP_NESTED=false
     OMP_NUM_THREADS: value is not defined
     OMP_PLACES: value is not defined
     OMP_PROC_BIND='false'
     OMP_SCHEDULE='static'
     OMP_STACKSIZE=4M
     OMP_THREAD_LIMIT=2147483647
     OMP_WAIT_POLICY=PASSIVE
     KMP_AFFINITY='noverbose,warnings,respect,granularity=core,duplicates,none'

Removing num_threads(...) from the !$omp parallel and adding "if(omp_get_thread_num() == 0) print *,omp_get_num_threads()",
the number 28 was output.

After I tried KMP_BLOCKTIME=0, only KMP_ASAT_INTERVAL changed from 5 to 1000 and the value of other variables were the same.

I examined serial and parallel versions of MKL (-mkl=sequential and -mkl=parallel).
In addition, I examined static linking of MKL (-static-intel), since I found that the default is dynamic link in Linux but static in Windows.
In total, I tried 2*2 = 4 compile options, unfortunately however, the situation does not still improve so much...

Linux(CentOS 6.6), Xeon E5-2695v3 (14 cores), ifort (IFORT) 15.0.3 20150407

ifort -qopenmp -mkl=sequential -static-intel
    1 thread     2 threads     4 threads (I confirmed the corresponding numbers 1, 2, or 4 were returned by print *,omp_get_num_threads())
    n=1    0.0325 0.0225 0.0285
    n=2    0.0217 0.0176 0.0258
    n=3    0.0218 0.0177 0.0247
    n=4    0.0216 0.0172 0.0267
    n=5    0.0216 0.0136 0.0274
    n=6    0.0216 0.0125 0.0284
    n=7    0.0217 0.0124 0.0274
    n=8    0.0217 0.0124 0.0266
    n=9    0.0215 0.0125 0.0277
    n=10   0.0215 0.0123 0.0280

ifort -qopenmp -mkl=parallel -static-intel
    1 thread     2 threads     4 threads
    n=1    0.0331 0.0230 0.0238
    n=2    0.0217 0.0180 0.0209
    n=3    0.0218 0.0179 0.0223
    n=4    0.0216 0.0176 0.0208
    n=5    0.0215 0.0146 0.0239
    n=6    0.0215 0.0168 0.0221
    n=7    0.0216 0.0139 0.0220
    n=8    0.0216 0.0136 0.0259
    n=9    0.0215 0.0136 0.0231
    n=10   0.0214 0.0136 0.0249

ifort -qopenmp -mkl=sequential
    1 thread     2 threads     4 threads
    n=1    0.0384 0.0224 0.0265
    n=2    0.0310 0.0175 0.0216
    n=3    0.0310 0.0175 0.0244
    n=4    0.0265 0.0170 0.0235
    n=5    0.0215 0.0124 0.0232
    n=6    0.0215 0.0120 0.0257
    n=7    0.0215 0.0122 0.0262
    n=8    0.0216 0.0121 0.0253
    n=9    0.0214 0.0121 0.0218
    n=10   0.0214 0.0126 0.0246

ifort -qopenmp -mkl=parallel
    1 thread     2 threads     4 threads
    n=1    0.0216 0.0208 0.0194
    n=2    0.0215 0.0171 0.0174
    n=3    0.0216 0.0168 0.0169
    n=4    0.0214 0.0170 0.0165
    n=5    0.0214 0.0168 0.0171
    n=6    0.0214 0.0161 0.0193
    n=7    0.0215 0.0121 0.0212
    n=8    0.0215 0.0120 0.0231
    n=9    0.0214 0.0120 0.0241
    n=10   0.0214 0.0120 0.0232

Sincerely yours,

Masanari HATTORI

When I printed KMP_SETTINGS

Hattori__Masanari — Thu, 30 May 2019 05:57:33 GMT

When I printed KMP_SETTINGS on Windows machine, the following is obtained:

   KMP_ABORT_DELAY=0
   KMP_ABORT_IF_NO_IRML=false
   KMP_ADAPTIVE_LOCK_PROPS='1,1024'
   KMP_ALIGN_ALLOC=64
   KMP_ALL_THREADPRIVATE=144
   KMP_ALL_THREADS=32768
   KMP_ASAT_DEC=1
   KMP_ASAT_FAVOR=0
   KMP_ASAT_INC=4
   KMP_ASAT_INTERVAL=5
   KMP_ASAT_TRIGGER=5000
   KMP_ATOMIC_MODE=1
   KMP_BLOCKTIME=200
   KMP_CPUINFO_FILE: value is not defined
   KMP_DETERMINISTIC_REDUCTION=false
   KMP_DUPLICATE_LIB_OK=false
   KMP_FORCE_REDUCTION: value is not defined
   KMP_FOREIGN_THREADS_THREADPRIVATE=true
   KMP_FORKJOIN_BARRIER='2,2'
   KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper'
   KMP_FORKJOIN_FRAMES=true
   KMP_FORKJOIN_FRAMES_MODE=0
   KMP_GTID_MODE=2
   KMP_HANDLE_SIGNALS=false
   KMP_INIT_AT_FORK=true
   KMP_INIT_WAIT=2048
   KMP_ITT_PREPARE_DELAY=0
   KMP_LIBRARY=throughput
   KMP_LOCK_KIND=queuing
   KMP_MALLOC_POOL_INCR=1M
   KMP_MONITOR_STACKSIZE: value is not defined
   KMP_NEXT_WAIT=1024
   KMP_NUM_LOCKS_IN_BLOCK=1
   KMP_PLAIN_BARRIER='2,2'
   KMP_PLAIN_BARRIER_PATTERN='hyper,hyper'
   KMP_REDUCTION_BARRIER='1,1'
   KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper'
   KMP_SCHEDULE='static,balanced;guided,iterative'
   KMP_SETTINGS=true
   KMP_STACKOFFSET=0
   KMP_STACKSIZE=4M
   KMP_STORAGE_MAP=false
   KMP_TASKING=2
   KMP_TASK_STEALING_CONSTRAINT=1
   KMP_USE_IRML=false
   KMP_VERSION=false
   KMP_WARNINGS=true
   OMP_CANCELLATION=false
   OMP_DISPLAY_ENV=false
   OMP_DYNAMIC=false
   OMP_MAX_ACTIVE_LEVELS=2147483647
   OMP_NESTED=false
   OMP_NUM_THREADS: value is not defined
   OMP_PLACES: value is not defined
   OMP_PROC_BIND='false'
   OMP_SCHEDULE='static'
   OMP_STACKSIZE=4M
   OMP_THREAD_LIMIT=32768
   OMP_WAIT_POLICY=PASSIVE
   KMP_AFFINITY='noverbose,warnings,respect,granularity=core,duplicates,none'

Sincerely yours,

Masanari HATTORI

I've been using CentOS 7.2

jimdempseyatthecove — Thu, 30 May 2019 11:55:24 GMT

I've been using CentOS 7.2 for a few years now on Intel Xeon E5-2620v2 and Intel Xeon Phi KNL 5200 and haven't seen such an issue. Perhaps this may be related to you using CentOS 6.6.

One additional variable that came up last year was a customer of mine was testing a working application on a new platform (Xeon Gold system at vendors location) and reported a dramatic decrease in performance. After some investigation, it was found that the loaner test system was running within a virtualized system. Is your CentOS 6.6 running inside a virtual system? IOW under Hypervisor

Have you been able to use VTune on the CentOS system? (well both systems)

Jim Dempsey

Dear Mr. Dempsey,

Hattori__Masanari — Thu, 30 May 2019 13:23:15 GMT

Dear Mr. Dempsey,

Thank you for your message.

Our CentOS is not installed on virtual environment but installed on usual hardware.
As for the Vtune, I have not used it so far...

Now I tried another Linux machine (24core) with CentOS 7.4, though some other job is running and uses 100% CPU resources on the machine
(this is the reason why I tested only on one Linux machine so far).
But according to your information about OS, I decided to do test on it.

Up to 16threads, the performance is good (although I can not be confident a little bit since the result is the one with other job being simultaneously running):
So also for Linux no problem if the OS is sufficiently new??

Linux(CentOS 7.4) Xeon Gold 5118 2.3GHz 2CPU (24core)

1thread 2threads 4threads 8threads 16threads 24threads
0.0578 0.0279   0.0226   0.0086   0.0076    0.0331
0.0478 0.0267   0.0136   0.0069   0.0027    0.0050
0.0583 0.0268   0.0138   0.0071   0.0035    0.0052
0.0561 0.0266   0.0133   0.0069   0.0035    0.0049
0.0476 0.0264   0.0135   0.0069   0.0036    0.0042
0.0578 0.0266   0.0137   0.0069   0.0035    0.0049
0.0475 0.0268   0.0134   0.0071   0.0036    0.0047
0.0587 0.0267   0.0139   0.0072   0.0036    0.0043
0.0475 0.0268   0.0132   0.0070   0.0035    0.0043
0.0577 0.0266   0.0066   0.0070   0.0035    0.0044

I experienced in past that for my program, which uses SFMT19937 and vslskipaheadstream,
the intel compiler version newer than 2014 (2014 is OK but 2013 produces error) is necessary. (since these functions are published around 2014?)

The compliler on the former Linux machine I showed is not so old (version 2015).
Do you think the results imply not only the compiler but also the OS should be enough new for the good perfomance?
(CentOS 6.6 was released on Oct. 2014 and 7.4 on Aug. 2017)

Sincerely yours,

Masanari HATTORI

The optimization performance

jimdempseyatthecove — Thu, 30 May 2019 16:51:25 GMT

The optimization performance from 2015 to 2019 should be significant. This is due to architectural differences (read instruction set) between Xeon Scalable (AVX512) and Xeon En series (AVX2). Acquiring a license for 2019 update... provides you access to earlier versions as well e.g. 2017... I do suggest that you run verification test of your software using different optimization levels, different instruction sets, and different versions (should you run into problems with latest version, newest instruction set, highest optimization level).

The scaling curve you presented, maxing out at 16 threads (actual number unknown), may be more of an issue of number of memory channels (2x6=12), or cache (2x16.5=33MB).

Note, CentOS 6.6 predates AVX512. Xeon Gold may be problematic (or incomplete support) on CentOS 6.6.

Jim Dempsey

Dear Mr. Dempsey,

Hattori__Masanari — Sun, 02 Jun 2019 08:37:37 GMT

Dear Mr. Dempsey,

Understood. Thank you for your information and sparing your time for me.

Sincerely yours,

Masanari HATTORI