MKL and OpenMP perform slower than simple DO loops

Johannes_A_ · ‎12-28-2015

I would like to continue to post my performance problem with array operations a in a new topic. Some history can be found in https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/605372#comment-1854000 and http://openmp.org/forum/viewtopic.php?f=3&t=1682 . Some of the participants in these previous discussions got better performance on their HW and compilers than me. I also was advised to go for MKL and this is part of this topic. The dissapointing message: neither OMP nor MKL are faster than simple Do loops on my laptop.

My questions are: is it that my hardware is not suited for parallel calculations, or did I forget to use some special compiler options, or does HT in my Win7x64 home premium SP1 impede the performance (don't know how to supress it)? I am attaching my processor details (bandwidth issue etc.).

This is the test code, trying to compare Do-loops, vector notation, OMP and MKL.

! TESTS 26.12.2015
! Test speed for array operation y(i)=a*x(i)*y(i) in 4 different ways
!
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\bin\mklvars" intel64 mod
! "C:\Program Files (x86)\Intel\Composer XE 2013 SP1\bin\compilervars.bat" intel64 
! ifort testMKLvsOpenMP.f90 /QopenMP /Qmkl
!    Intel(R) Visual Fortran Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 14.0.2.176 Build 20140130
!    Microsoft (R) Incremental Linker Version 9.00.21022.08
!     -out:testMKLvsOpenMP.exe
!     -subsystem:console
!     -defaultlib:libiomp5md.lib
!     -nodefaultlib:vcomp.lib
!     -nodefaultlib:vcompd.lib
!     "-libpath:C:\Program Files (x86)\Intel\Composer XE 2013 SP1\mkl\lib\intel64"
!     testMKLvsOpenMP.obj
!
program TestMKLvsOpenMP
use omp_lib
IMPLICIT NONE
integer :: N
real*8,Allocatable :: x(:),y(:)
real*8 :: alpha
real*8 :: endtime,starttime,DSECND
real :: cpu1,cpu2
integer :: NTHREADS,irepeat,nrepeat,i
! initialize
alpha=.0001
print *,'N=?'
read *,N
nrepeat=1000000000/N          ! nrepeat*N=10 Mio
print *,'nrepeat=',nrepeat

Allocate (x(N),y(N))
x(:)=0.  ; y(:)=0.
pause 'Press Return'

! 1. standard do loops
forall (i=1:N) ; x(i)=i ; y(i)=-i ;end forall
Nthreads=0
starttime = OMP_get_wtime() 
Call cpu_time(cpu1)
do irepeat=1,nrepeat
 do i=1,N
   y(i)=alpha*x(i)+y(i)
 enddo
enddo
endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' DO time=',SNGL(endtime - starttime),cpu2-cpu1
pause 'Press Return'

! 2. vector
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
Nthreads=0
Call cpu_time(cpu1)
do irepeat=1,nrepeat
   y(1:N)=alpha*x(1:N)+y(1:N)
enddo
Call cpu_time(cpu2)
print *, 'Threads=',NTHREADS,' Vector time=',cpu2-cpu1
pause 'Press Return'

Nthreads=2

! 3. OMP
forall (i=1:N) ;x(i)=i ; y(i)=-i ; end forall

CALL OMP_SET_NUM_THREADS(NTHREADS)
starttime = OMP_get_wtime() ; Call cpu_time(cpu1)
  
!$OMP PARALLEL Shared(N,x,y,alpha)  
do irepeat=1,nrepeat
!$OMP DO  PRIVATE(i) SCHEDULE(static) 
 do i=1,N
   y(i)=alpha*x(i)+y(i)
 enddo
!$OMP END DO nowait
enddo
!$OMP END PARALLEL
  
endtime = OMP_get_wtime() ; Call cpu_time(cpu2)
print *, 'OMP Threads=',NTHREADS,' OMPtime=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads
pause 'Press Return'

! 4. MKL
forall (i=1:N) ; x(i)=i ; y(i)=-i ; end forall
starttime =DSECND() ; Call cpu_time(cpu1)
CALL MKL_SET_NUM_THREADS(NTHREADS)
do irepeat=1,nrepeat
  CALL daxpy(N,alpha,x,1, y ,1)
end do
endtime = DSECND(); Call cpu_time(cpu2)
print *, 'MKL Threads=',NTHREADS,' time=',SNGL(endtime - starttime),(cpu2-cpu1)/Nthreads

end

The results for N=1000000 (1 mio) and N=10000 are

 N=?
1000000
 nrepeat=        1000
Press Return
 Threads=           0  DO time=  0.1767103      0.1716011
Press Return
 Threads=           0  Vector time=  0.1716011
Press Return
 OMP Threads=           2  OMPtime=   1.397965       1.388409
Press Return
 MKL Threads=           2  time=   1.406852       1.404009

 N=?
10000
 nrepeat=      100000
Press Return
 Threads=           0  DO time=  0.1744737      0.1560010
Press Return
 Threads=           0  Vector time=  0.1716011
Press Return
 OMP Threads=           2  OMPtime=  0.2589355      0.2574016
Press Return
 MKL Threads=           2  time=  0.3295782      0.3120020

When I go down to N=100 OMP and MKL run slower

 OMP Threads=           2  OMPtime=  0.8096205      0.8112052
 MKL Threads=           2  time=  0.4926147      0.2418015

All comments are welcome.

TimP · ‎12-28-2015

I'm having difficulty understanding the limits of the topics you are interested in, and how you expect to generalize from there to the idea that your CPU is unsuited to parallel usage.

If you have just 2 cores, as we discussed recently, 2 threads should be sufficient to get full usage of the floating point hardware.

If you compare 1 thread vectorized with 2 threads scalar, you wouldn't expect an advantage for 2 threads.

Also, it's no surprise that multi-threading is of no value for simple loops of length 100. Even if you care to vectorize, each thread gets a loop which is too short to approach full vector performance. MKL may take the short loop as a key to use just 1 thread; MKL_NUM_THREADS may be taken only as an upper limit. For another example, MKL will not use multiple threads per core without additional settings which you didn't mention having tried.

MKL level 1 BLAS calls were of value in the days when people used non-vectorizing compilers. The greater value of MKL (and vector inner parallel outer nested loops) come in level 2 and 3 operations.

Your simple hardware won't run into many of the NUMA issues of threaded parallelism, but won't impress you if you are looking for more than 1.5x to 2x speedup from multi-thread. On the other hand, you won't need as much effort or as sophisticated problems to see the expected advantage.

Options like /QaxAVX /arch:SSE3 are intended to help you cover a range of target architectures with near optimum vector instruction selection, as MKL does internally.

Johannes_A_ · ‎12-28-2015

On my laptop Jim Dempsey's code in https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/605372#comment-1854000 is running faster with 2 threads, indeed. From that point of view parallel operations do work. Why is the OMP performance of my code so bad? Is it because this is a loop with only 1 statement? Did you compile it on your PC with your favourite options and what would be the cpu time?

TimP · ‎12-28-2015

Your "Vector time" case allows ifort to optimize away more redundant repeated operations than the thread parallelized ones, so it's not a realistic comparison unless that is specifically what you are trying to assess. It seems clear when you run it that "Vector time" doesn't count nrepeat repetitions. My laptop is somewhat faster on the single repetition of vectorized loop than yours, but not on the repeated threaded cases.

Even if you run a test where each thread performs as much work as your single thread comparison case (which it looks like you have done), you would expect some slowdown. The point of parallelism is to divide the work among threads.

Johannes_A_ · ‎12-28-2015

Tim, you were right in assuming that the compiler is optimizing redundant loops away!

I added

   alpha=alpha*irepeat

within each of the outer loops and the results are more understandable:

 N=?
1000000
 nrepeat=        1000
 Threads=           0  DO time=   1.449055       1.450809
 Threads=           0  Vector time=   1.466409
 OMP Threads=           2  OMPtime=   1.379888       1.372809
 MKL Threads=           2  time=   1.401189       1.404009

It looks as if 1 million is to much for the bandwidth of my processor. With 100,000 it works as expected:

 N=?
100000
 nrepeat=       10000

 Threads=           0  DO time=  0.6784180      0.6864044
 Threads=           0  Vector time=  0.6864044
 OMP Threads=           2  OMPtime=  0.3938010      0.3900025
 MKL Threads=           2  time=  0.3822132      0.3900025

N=200,000 give a lower enhancement with OMP and 80,000 as well.

Thanks for the solution.

John_Campbell · ‎12-29-2015

Johannes,

One of the main problems in testing !$OMP is using loop running tests that are too short. In Windows, to enter a !$OMP region it takes about 5 micro seconds ( 2 to 50, depending on the compiler) This is about 20,000 processor cycles which will swamp a small DO loop.

I have been doing some similar tests to yours lately with matrix multiplication (basically same calculation as you are using). These tests have considered:

The number of threads being used,
The size of the multiplication, and
Use of cache size targeting strategies.

I have run these on a variety of processors to relate the threads to cache size and memory speed. I tested an i5-2300, i7-4790k and i5-4200U. It is interesting to see how these perform vs number of threads in use, especially for the hyper-threaded 4200. ( I shall find a chart of results)

One of the problems for loops like these, that do not have complex calculations is that you are limited by the memory transfer rate. Cacheing strategies can improve the process rate but still become limited by memory access rates with fewer threads. !$OMP is very suited to threads with lots of complex calculations that can fit into the cache. Once the calculation simplifies eg vector multiplication or becomes large (too big for the cache), you have to address the other bottlenecks.

Attached is a sample program I have adapted from another site to try and monitor these problems. You could test it on different processors and graph gflop per second vs threads in use for different size matrices and different processors you will get an understanding of the interaction. The program uses different multiply approaches:

matmul intrinsic
single thread do loops
multi thread do loops
multi thread do loops, partitioned for cache size blocks

The different results can show the effects. I hope this helps

John