- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a process where I need to do repeated 3D convolutions and correlations between "large" blocks of data (say 500x50x50) and small operators (typically 51x5x5, centered along all dimensions).
I have some routines that I wrote for that, just a "naive" implementation with 5 nested loops, plus a call to BLAS `sapxy` or `sdot` in the inner loop (that is, 6 nested loops, as expected).
Recently I have tested the MKL routines that perform multidimensional convolutions/correlations, to see if I could get some speed-up compared to my naive implementation. And the results somehow puzzle me...
For simplicity I took blocks and operators of equal sizes in all dimensions: data blocks are 100x100x100, and I am testing various operator sizes `lop * lop * lop`, with `lop` varying from 1 to 25. The MKL routines have 3 different modes: "DIRECT", "FFT", "AUTO", and I'm testing all of them. I am using the ifort 18 version, with `-O3 -mkl` options, and I set `MKL_NUM_THREADS=1` to prevent MKL using multithreading.
The timings are shown below:
As expected, the FFT mode performance does not depend that much on the operator size (as all FFTs are probably set according to the largest size between the data block and the operator), and consequently this mode is faster for large enough operator sizes.
But besides there are some surprises:
- The DIRECT mode is always much slower than my naive loops. I was expecting this mode to implement something very similar to what I'm doing, but possibly better optimized (with blocking or whatever). At worst I was expecting similar runtimes, not larger ones.
- In the correlations, the DIRECT mode is faster than the FFT mode for small operator sizes, but still the AUTO mode internally selects the FFT mode. It seems the AUTO mode always reverts to FFT, which looks like a bug to me.
- In the FFT mode, correlations are about 4x slower than the convolutions (and I can't get why).
I would be interested in your comments and possible explanations.
The test code is below... BTW how Intel is distributing the headers if weird: instead of a .mod file, this is an source include file that contains Fortran modules. Not convenient when you have constraints on the module names in a project (unless copying the file in the project and modifying the names).
module convcorr_m
implicit none
public
contains
!*******************************************************************************
! 3D convolution y = y + f*x *
! or *
! 3D correlation f = f + x'*y *
!*******************************************************************************
subroutine cc_sconv3d(x,y,n1,n2,n3,f,l1n,l1p,l2n,l2p,l3n,l3p,cc,reset)
implicit none
! Arguments
integer , intent(in ) :: n1, n2, n3, l1n, l1p, l2n, l2p, l3n, l3p
real , intent(in ) :: x(n1,n2,n3)
real , intent(inout) :: y(n1,n2,n3)
real , intent(inout) :: f(l1n:l1p,l2n:l2p,l3n:l3p)
character(len=*), intent(in ), optional :: cc
logical , intent(in ), optional :: reset
! Local variables
integer :: mode, i2, i3, j1, j2, j3, i1i, i1f, i2i, i2f, i3i, i3f
real :: sdot
!*******************************************************************************
mode = 0
if (present(cc)) then
if (cc == 'corr' .or. cc == 'CORR') mode = 1
end if
if (mode == 0) then
if (present(reset)) then
if (reset) y(:,:,:) = 0.0
end if
do j3 = l3n, l3p
i3i = max(1,1+j3) ; i3f = min(n3,n3+j3)
do i3 = i3i, i3f
do j2 = l2n, l2p
i2i = max(1,1+j2) ; i2f = min(n2,n2+j2)
do i2 = i2i, i2f
do j1 = l1n, l1p
i1i = max(1,1+j1) ; i1f = min(n1,n1+j1)
call saxpy(i1f-i1i+1,f(j1,j2,j3),x(i1i-j1,i2-j2,i3-j3),1,y(i1i,i2,i3),1)
end do
end do
end do
end do
end do
else
if (present(reset)) then
if (reset) f(:,:,:) = 0.0
end if
do j3 = l3n, l3p
i3i = max(1,1+j3) ; i3f = min(n3,n3+j3)
do j2 = l2n, l2p
i2i = max(1,1+j2) ; i2f = min(n2,n2+j2)
do i3 = i3i, i3f
do i2 = i2i, i2f
do j1 = l1n, l1p
i1i = max(1,1+j1) ; i1f = min(n1,n1+j1)
f(j1,j2,j3) = f(j1,j2,j3) + sdot(i1f-i1i+1,y(i1i,i2,i3),1,x(i1i-j1,i2-j2,i3-j3),1)
end do
end do
end do
end do
end do
end if
end subroutine cc_sconv3d
end module convcorr_m
include "mkl_vsl.fi"
program convcorr
use iso_fortran_env
use convcorr_m
use mkl_vsl
implicit none
integer, parameter :: n = 100
real, allocatable :: x(:,:,:), y(:,:,:), f(:,:,:)
integer :: lop, convmode, corrmode, stat, imode, npass
type(vsl_conv_task) :: taskonv
type(vsl_corr_task) :: taskorr
integer(int64) :: tic, toc
real(real64) :: rate
allocate( x(n,n,n), y(n,n,n) )
write(*,"(4X,4A24)") "---- LOOPS -----", "-- VSL_DIRECT --", "--- VSL_FFT ----", "--- VSL_AUTO ---"
write(*,"(A4,8A12)") "LOP", "CONV", "CORR", "CONV", "CORR", "CONV", "CORR", "CONV", "CORR"
do lop = 1, 25, 1
write(*,"(4I4)",advance='no') lop
allocate( f(lop,lop,lop) )
call random_number(x); x = x - 0.5
call random_number(f); f = f - 0.5
call system_clock(tic,rate); npass = 0
do
npass = npass+1
call cc_sconv3d(x,y,n,n,n,f,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,reset=.true.)
call system_clock(toc,rate)
if ((toc-tic)/rate >= 1.0) exit
end do
write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass
call system_clock(tic,rate); npass = 0
do
npass = npass+1
call cc_sconv3d(x,y,n,n,n,f,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,-lop/2,(lop-1)/2,cc='CORR',reset=.true.)
call system_clock(toc,rate)
if ((toc-tic)/rate >= 1.0) exit
end do
write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass
VSLMODE: do imode = 1, 3
if (imode == 1) then
convmode = VSL_CONV_MODE_DIRECT
corrmode = VSL_CORR_MODE_DIRECT
else if (imode == 2) then
convmode = VSL_CONV_MODE_FFT
corrmode = VSL_CORR_MODE_FFT
else if (imode == 3) then
convmode = VSL_CONV_MODE_AUTO
corrmode = VSL_CORR_MODE_AUTO
end if
call random_number(x); x = x - 0.5
call random_number(f); f = f - 0.5
stat = vslsconvnewtask(taskonv, convmode, 3, [n,n,n], [lop,lop,lop], [n,n,n])
stat = vslconvsetstart(taskonv,[lop/2,lop/2,lop/2])
stat = vslconvsetinternalprecision(taskonv, VSL_CONV_PRECISION_SINGLE)
stat = vslscorrnewtask(taskorr, corrmode, 3, [n,n,n], [n,n,n], [lop,lop,lop])
stat = vslcorrsetstart(taskorr,[-lop/2,-lop/2,-lop/2])
stat = vslcorrsetinternalprecision(taskorr, VSL_CORR_PRECISION_SINGLE)
call system_clock(tic,rate); npass = 0
do
npass = npass+1
stat = vslsconvexec(taskonv,x,[1,n,n*n],f,[1,lop,lop*lop],y,[1,n,n*n])
call system_clock(toc,rate)
if ((toc-tic)/rate >= 1.0) exit
end do
write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass
call system_clock(tic,rate); npass = 0
do
npass = npass+1
stat = vslscorrexec(taskorr,x,[1,n,n*n],y,[1,n,n*n],f,[1,lop,lop*lop])
call system_clock(toc,rate)
if ((toc-tic)/rate >= 1.0) exit
end do
write(*,"(F12.6)",advance='no') ((toc-tic)/rate)/npass
stat = vslconvdeletetask(taskonv)
stat = vslcorrdeletetask(taskorr)
end do VSLMODE
deallocate( f )
write(*,*)
end do
end program
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting in Intel Communities.
Could you please let us know the OS details, hardware details, and Intel MKL you are using?
When we are running the sample reproducer code provided by you on Ubuntu 22.04 with ifx and ifort compiler using Intel MKL 2023.2 below is the output of the results:
Using IFORT compiler:
Using IFX compiler:
Could you provide us with the screenshot/log of your output to investigate more from our end?
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
As I wrote, the compiler and associated MKL in my test is ifort 18, the compilation options was "-03 -mkl", and OMP_NUM_THREADS was set for the execution.
The OS is debian 9, and the CPU is a Xeon Silver 4214
Here is the text output by for forcing the sequential version of MKL (it doesn't really change the timings):
% which ifort
/opt/intel/18/bin/ifort
% ifort -O3 -mkl=sequential cc.f90 && ./a.out
---- LOOPS ----- -- VSL_DIRECT -- --- VSL_FFT ---- --- VSL_AUTO ---
LOP CONV CORR CONV CORR CONV CORR CONV CORR
1 0.000920 0.000712 0.032650 0.000612 0.050624 0.361804 0.040309 0.352007
2 0.003162 0.003109 0.048142 0.004074 0.049432 0.343827 0.048406 0.321989
3 0.008427 0.008863 0.080861 0.014033 0.046727 0.318507 0.046812 0.328444
4 0.019814 0.020704 0.119471 0.038476 0.048586 0.305242 0.046755 0.306700
5 0.037547 0.038698 0.201102 0.069796 0.046356 0.330719 0.048442 0.336143
6 0.064708 0.065071 0.302219 0.144416 0.050406 0.329016 0.045567 0.334844
7 0.101556 0.108460 0.433533 0.197357 0.045643 0.326450 0.047570 0.322815
8 0.136095 0.138663 0.525409 0.314071 0.047657 0.338162 0.047652 0.369353
9 0.214894 0.198988 0.695898 0.442307 0.042458 0.315070 0.048121 0.338131
10 0.275649 0.317205 0.919781 0.618488 0.044645 0.323449 0.043268 0.313107
11 0.366579 0.397676 1.151432 0.786910 0.044086 0.334565 0.043662 0.320125
12 0.479751 0.483298 1.477239 1.083716 0.053920 0.400745 0.054196 0.372145
13 0.673736 0.664420 1.900096 1.470831 0.052994 0.411026 0.041167 0.324439
14 0.709325 0.713472 2.087543 1.518821 0.197426 0.318124 0.200461 0.326903
15 0.860854 0.914113 2.718536 1.934318 0.208040 0.332190 0.192546 0.345011
16 1.178163 1.195436 3.121390 2.630440 0.210792 0.361835 0.205992 0.319658
17 1.315133 1.411694 3.527859 3.118241 0.208020 0.342037 0.222061 0.335961
18 1.635286 1.643030 4.190005 3.635694 0.056894 0.356624 0.056922 0.358222
19 1.922034 2.077460 4.806284 3.900750 0.052932 0.342664 0.051664 0.335496
20 2.043099 2.018506 5.306534 4.443163 0.053061 0.343031 0.049266 0.324691
21 2.425856 2.508777 6.264581 5.418108 0.053774 0.325043 0.052564 0.335544
22 2.837743 2.843541 7.108058 5.901328 0.225201 0.334280 0.243002 0.351135
23 3.204777 3.285799 7.969262 6.874970 0.233537 0.326068 0.228258 0.332914
24 3.466661 3.792434 8.498142 8.040788 0.276667 0.360287 0.250703 0.338130
25 3.910451 3.979430 9.637301 8.103801 0.226557 0.341711 0.234451 0.321385
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for the details.
Could you please provide us with the complete output/results you are getting with the naive loops so that we can understand and investigate more?
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
It's unclear to me what "complete output/results [I am] getting with the naive loops" you are expecting me to post, as I have already posted the source code, and the terminal output of the execution...
Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on your issue internally, we will get back to you soon.
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We were able to reproduce your issue and reported the issue also our development team is looking into your issue. We will update you once there is an update from them.
Thanks & Regards,
Varsha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for helping us improve our products!
We’ve submitted the feature request to the dev team, they will consider it based on multiple factors including, but not limited to priority and criticality of the feature. Once it is included in an upcoming release, it would be documented in the release notes
Thanks & Regards,
Varsha
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page