Re: Performance of Intel MKL routines for convolutions/correlations

PierU · ‎10-12-2023

Hello,

I have a process where I need to do repeated 3D convolutions and correlations between "large" blocks of data (say 500x50x50) and small operators (typically 51x5x5, centered along all dimensions).

I have some routines that I wrote for that, just a "naive" implementation with 5 nested loops, plus a call to BLAS `sapxy` or `sdot` in the inner loop (that is, 6 nested loops, as expected).

Recently I have tested the MKL routines that perform multidimensional convolutions/correlations, to see if I could get some speed-up compared to my naive implementation. And the results somehow puzzle me...

For simplicity I took blocks and operators of equal sizes in all dimensions: data blocks are 100x100x100, and I am testing various operator sizes `lop * lop * lop`, with `lop` varying from 1 to 25. The MKL routines have 3 different modes: "DIRECT", "FFT", "AUTO", and I'm testing all of them. I am using the ifort 18 version, with `-O3 -mkl` options, and I set `MKL_NUM_THREADS=1` to prevent MKL using multithreading.

The timings are shown below:

As expected, the FFT mode performance does not depend that much on the operator size (as all FFTs are probably set according to the largest size between the data block and the operator), and consequently this mode is faster for large enough operator sizes.

But besides there are some surprises:

The DIRECT mode is always much slower than my naive loops. I was expecting this mode to implement something very similar to what I'm doing, but possibly better optimized (with blocking or whatever). At worst I was expecting similar runtimes, not larger ones.
In the correlations, the DIRECT mode is faster than the FFT mode for small operator sizes, but still the AUTO mode internally selects the FFT mode. It seems the AUTO mode always reverts to FFT, which looks like a bug to me.
In the FFT mode, correlations are about 4x slower than the convolutions (and I can't get why).

I would be interested in your comments and possible explanations.

The test code is below... BTW how Intel is distributing the headers if weird: instead of a .mod file, this is an source include file that contains Fortran modules. Not convenient when you have constraints on the module names in a project (unless copying the file in the project and modifying the names).

module convcorr_m
  implicit none

  public

contains
  !*******************************************************************************
  ! 3D convolution y = y + f*x                                                   *
  ! or                                                                           *
  ! 3D correlation f = f + x'*y                                                  *
  !*******************************************************************************
  subroutine cc_sconv3d(x,y,n1,n2,n3,f,l1n,l1p,l2n,l2p,l3n,l3p,cc,reset)

    implicit none

    ! Arguments
    integer         , intent(in   ) :: n1, n2, n3, l1n, l1p, l2n, l2p, l3n, l3p
    real            , intent(in   ) :: x(n1,n2,n3)
    real            , intent(inout) :: y(n1,n2,n3)
    real            , intent(inout) :: f(l1n:l1p,l2n:l2p,l3n:l3p)
    character(len=*), intent(in   ), optional :: cc
    logical         , intent(in   ), optional :: reset

    ! Local variables
    integer :: mode, i2, i3, j1, j2, j3, i1i, i1f, i2i, i2f, i3i, i3f
    real    :: sdot

    !*******************************************************************************

    mode = 0
    if (present(cc)) then
       if (cc == 'corr' .or. cc == 'CORR') mode = 1
    end if

    if (mode == 0) then

       if (present(reset)) then
          if (reset) y(:,:,:) = 0.0
       end if

       do j3 = l3n, l3p
          i3i = max(1,1+j3) ; i3f = min(n3,n3+j3)

          do i3 = i3i, i3f
             do j2 = l2n, l2p
                i2i = max(1,1+j2) ; i2f = min(n2,n2+j2)
                do i2 = i2i, i2f
                   do j1 = l1n, l1p
                      i1i = max(1,1+j1) ; i1f = min(n1,n1+j1)
                      call saxpy(i1f-i1i+1,f(j1,j2,j3),x(i1i-j1,i2-j2,i3-j3),1,y(i1i,i2,i3),1)
                   end do
                end do
             end do
          end do

       end do

    else

       if (present(reset)) then
          if (reset) f(:,:,:) = 0.0
       end if


          do j3 = l3n, l3p
             i3i = max(1,1+j3) ; i3f = min(n3,n3+j3)
             do j2 = l2n, l2p
                i2i = max(1,1+j2) ; i2f = min(n2,n2+j2)
                do i3 = i3i, i3f
                   do i2 = i2i, i2f
                      do j1 = l1n, l1p
                         i1i = max(1,1+j1) ; i1f = min(n1,n1+j1)
                         f(j1,j2,j3) = f(j1,j2,j3) + sdot(i1f-i1i+1,y(i1i,i2,i3),1,x(i1i-j1,i2-j2,i3-j3),1)
                      end do
                   end do
                end do
             end do
          end do


    end if
    

  end subroutine cc_sconv3d

end module convcorr_m


include "mkl_vsl.fi"


program convcorr
use iso_fortran_env
use convcorr_m
use mkl_vsl
implicit none

	integer, parameter :: n = 100
	real, allocatable :: x(:,:,:), y(:,:,:), f(:,:,:)

	integer :: lop, convmode, corrmode, stat, imode, npass
	type(vsl_conv_task) :: taskonv
	type(vsl_corr_task) :: taskorr

	integer(int64) :: tic, toc
	real(real64) :: rate


	allocate( x(n,n,n), y(n,n,n) )

	write(*,"(4X,4A24)") "---- LOOPS -----", "-- VSL_DIRECT --", "--- VSL_FFT ----", "--- VSL_AUTO ---"
	write(*,"(A4,8A12)") "LOP", "CONV", "CORR", "CONV", "CORR", "CONV", "CORR", "CONV", "CORR"

	do lop = 1, 25, 1

		write(*,"(4I4)",advance='no') lop

		allocate( f(lop,lop,lop) )
		call random_number(x); x = x - 0.5
		call random_number(f); f = f - 0.5

		call system_clock(tic,rate); npass = 0
		do
			npass = npass+1
			call cc_sconv3d(x,y,n,n,n,f,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,reset=.true.)
			call system_clock(toc,rate)
			if ((toc-tic)/rate >= 1.0) exit
		end do

		write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

		call system_clock(tic,rate); npass = 0
		do
			npass = npass+1
			call cc_sconv3d(x,y,n,n,n,f,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,cc='CORR',reset=.true.)
			call system_clock(toc,rate)
			if ((toc-tic)/rate >= 1.0) exit
		end do
		write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

        VSLMODE: do imode = 1, 3

			if (imode == 1) then
				convmode = VSL_CONV_MODE_DIRECT
				corrmode = VSL_CORR_MODE_DIRECT
			else if (imode == 2) then
				convmode = VSL_CONV_MODE_FFT
				corrmode = VSL_CORR_MODE_FFT
			else if (imode == 3) then
				convmode = VSL_CONV_MODE_AUTO
				corrmode = VSL_CORR_MODE_AUTO
			end if

			call random_number(x); x = x - 0.5
			call random_number(f); f = f - 0.5

			stat = vslsconvnewtask(taskonv, convmode, 3, [n,n,n], [lop,lop,lop], [n,n,n])
			stat = vslconvsetstart(taskonv,[lop/2,lop/2,lop/2])
	 		stat = vslconvsetinternalprecision(taskonv, VSL_CONV_PRECISION_SINGLE)

			stat = vslscorrnewtask(taskorr, corrmode, 3, [n,n,n], [n,n,n], [lop,lop,lop])
			stat = vslcorrsetstart(taskorr,[-lop/2,-lop/2,-lop/2])
	 		stat = vslcorrsetinternalprecision(taskorr, VSL_CORR_PRECISION_SINGLE)

			call system_clock(tic,rate); npass = 0
			do
				npass = npass+1
				stat = vslsconvexec(taskonv,x,[1,n,n*n],f,[1,lop,lop*lop],y,[1,n,n*n])
				call system_clock(toc,rate)
				if ((toc-tic)/rate >= 1.0) exit
			end do
			write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

			call system_clock(tic,rate); npass = 0
			do
				npass = npass+1
				stat = vslscorrexec(taskorr,x,[1,n,n*n],y,[1,n,n*n],f,[1,lop,lop*lop])
				call system_clock(toc,rate)
				if ((toc-tic)/rate >= 1.0) exit
			end do
			write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass

			stat = vslconvdeletetask(taskonv)
			stat = vslcorrdeletetask(taskorr)

		end do VSLMODE

		deallocate( f )
		write(*,*)

	end do

end program

VarshaS_Intel · ‎10-16-2023

Hi,

Thanks for posting in Intel Communities.

Could you please let us know the OS details, hardware details, and Intel MKL you are using?

When we are running the sample reproducer code provided by you on Ubuntu 22.04 with ifx and ifort compiler using Intel MKL 2023.2 below is the output of the results:

Using IFORT compiler:

Using IFX compiler:

Could you provide us with the screenshot/log of your output to investigate more from our end?

Thanks & Regards,

Varsha

PierU · ‎10-16-2023

Hello,

As I wrote, the compiler and associated MKL in my test is ifort 18, the compilation options was "-03 -mkl", and OMP_NUM_THREADS was set for the execution.

The OS is debian 9, and the CPU is a Xeon Silver 4214

Here is the text output by for forcing the sequential version of MKL (it doesn't really change the timings):

% which ifort
/opt/intel/18/bin/ifort
% ifort -O3 -mkl=sequential cc.f90 && ./a.out
            ---- LOOPS -----        -- VSL_DIRECT --        --- VSL_FFT ----        --- VSL_AUTO ---
 LOP        CONV        CORR        CONV        CORR        CONV        CORR        CONV        CORR
   1    0.000920    0.000712    0.032650    0.000612    0.050624    0.361804    0.040309    0.352007
   2    0.003162    0.003109    0.048142    0.004074    0.049432    0.343827    0.048406    0.321989
   3    0.008427    0.008863    0.080861    0.014033    0.046727    0.318507    0.046812    0.328444
   4    0.019814    0.020704    0.119471    0.038476    0.048586    0.305242    0.046755    0.306700
   5    0.037547    0.038698    0.201102    0.069796    0.046356    0.330719    0.048442    0.336143
   6    0.064708    0.065071    0.302219    0.144416    0.050406    0.329016    0.045567    0.334844
   7    0.101556    0.108460    0.433533    0.197357    0.045643    0.326450    0.047570    0.322815
   8    0.136095    0.138663    0.525409    0.314071    0.047657    0.338162    0.047652    0.369353
   9    0.214894    0.198988    0.695898    0.442307    0.042458    0.315070    0.048121    0.338131
  10    0.275649    0.317205    0.919781    0.618488    0.044645    0.323449    0.043268    0.313107
  11    0.366579    0.397676    1.151432    0.786910    0.044086    0.334565    0.043662    0.320125
  12    0.479751    0.483298    1.477239    1.083716    0.053920    0.400745    0.054196    0.372145
  13    0.673736    0.664420    1.900096    1.470831    0.052994    0.411026    0.041167    0.324439
  14    0.709325    0.713472    2.087543    1.518821    0.197426    0.318124    0.200461    0.326903
  15    0.860854    0.914113    2.718536    1.934318    0.208040    0.332190    0.192546    0.345011
  16    1.178163    1.195436    3.121390    2.630440    0.210792    0.361835    0.205992    0.319658
  17    1.315133    1.411694    3.527859    3.118241    0.208020    0.342037    0.222061    0.335961
  18    1.635286    1.643030    4.190005    3.635694    0.056894    0.356624    0.056922    0.358222
  19    1.922034    2.077460    4.806284    3.900750    0.052932    0.342664    0.051664    0.335496
  20    2.043099    2.018506    5.306534    4.443163    0.053061    0.343031    0.049266    0.324691
  21    2.425856    2.508777    6.264581    5.418108    0.053774    0.325043    0.052564    0.335544
  22    2.837743    2.843541    7.108058    5.901328    0.225201    0.334280    0.243002    0.351135
  23    3.204777    3.285799    7.969262    6.874970    0.233537    0.326068    0.228258    0.332914
  24    3.466661    3.792434    8.498142    8.040788    0.276667    0.360287    0.250703    0.338130
  25    3.910451    3.979430    9.637301    8.103801    0.226557    0.341711    0.234451    0.321385

VarshaS_Intel · ‎10-24-2023

Hi,

Thanks for the details.

Could you please provide us with the complete output/results you are getting with the naive loops so that we can understand and investigate more?

Thanks & Regards,

Varsha

PierU · ‎10-29-2023

Hi,

It's unclear to me what "complete output/results [I am] getting with the naive loops" you are expecting me to post, as I have already posted the source code, and the terminal output of the execution...

Regards,

VarshaS_Intel · ‎11-08-2023

Hi,

We are working on your issue internally, we will get back to you soon.

Thanks & Regards,

Varsha

VarshaS_Intel · ‎12-04-2023

Hi,

We were able to reproduce your issue and reported the issue also our development team is looking into your issue. We will update you once there is an update from them.

Thanks & Regards,

Varsha

VarshaS_Intel · ‎12-15-2023

Hi,

Thanks for helping us improve our products!

We’ve submitted the feature request to the dev team, they will consider it based on multiple factors including, but not limited to priority and criticality of the feature. Once it is included in an upcoming release, it would be documented in the release notes

Thanks & Regards,

Varsha