Speed of vslsconv in 10.1

wuebbel · ‎01-29-2010

Hi,

I'm puzzled by the speed of vslsconv. I'm trying to convolve a 1000 by 1000 image with a10 by 10 (or smaller) stencil. Appropriate FFT algorithms should expandboth to 1024 by 1024, take Fourier transforms (where the FT of the stencil of course can be done fast), multiplyand do the backtransform. Since everything is real, it can be done using real single precision FFTs, andtime takenshould be not much more than twice (at most three times) the time for a real FT plus some overhead for the multiplication. (Besides, a nice feature would be to be able to provide the convolution with a precomputed FT for the stencil - since you wouldoften have aconstant stencil over many calls.)

Time for doing a 1024 by 1024 FT on our four-core Q9550 @ 2.83GHzis at about 2.6 milliseconds, so I expected a speed for the convolutionbelow7 milliseconds. However, when calling vslsConvExec, it uses roughly 70 milliseconds. Part of this is due to the fact that with the same compiler environment,calling theFFT directly uses multicores, while calling FFT indirectly via vslsConvExec executes on a single core only. But, assuming I could somehow get this factor of 4. that's still rather slow. Moving to direct computation instead of FFT does not help, time is not smaller.

Any idea?We're currently using 10.1. I'm including the main parameters below.

Frank

FFT:

precision=DFTI_SINGLE;

forward_domain=DFTI_REAL;

convolution:

mode=VSL_CONV_MODE_FFT;

Vladimir_Petrov__Int · ‎01-29-2010

Frank,

Thank you for your interest in MKL.

Your analysis has one weak point - the convolution implemented in MKL is not circular (or cyclic). As such, in order to apply the technique you described in your post it requires FFTs of size 2048x2048 (not 1024x1024) . Keep in mind that the data of this size do not fit in cache and hence, the performance in Gflop/s is worse than that for 1024x1024 FFT. This means that the computation time for one FFT of 2048x2048 data points will be noticeably more than four times longer than that of 1024x1024 data points.

As to the threading, my expectation is that the convolution in MKL is threaded.

Best regards,

-Vladimir

wuebbel · ‎01-30-2010

Dear Vladimir,

thanks for your quick answer! I'm sorry this is getting a little technical now, but I think you're not right:The image is 1000 by 1000, but the stencil is only 10 by 10, meaning that we need a padding on the original image of 9 pixels to guarantee that the cyclic convolution is the same as the "normal" one, so the next power of 2 is 1024.

The reason I'm so picky here: We're trying to honestly compare MKL-implementations of basic imaging techniques with CUDA-based implementations for hardware selection. The 1000 by 1000 image convolution with a 10 by 10 stencil is one of the most convincing demos in CUDA (and, by the way, boils everything down to a 1024 by 1024 FFT). Currently, the MKL is behind with a factor ofmuch more than 10- which is totally unacceptable, when we found that the FFT-speed is actually fairly comparable.

> As to the threading, my expectation is that the convolution in MKL is threaded.

Hmm. In my test in the same program, the FFT without additional parameters is threaded, the convolution using FFT is not (however, that's for 10.1, I'll be switching to 10.2 next week). Do you have an idea if there is a switch to turn threading on? I couldn't find one.

I'll try to come up witha competing implementation to better illustrate the problem.

Best wishes, Frank

Vladimir_Petrov__Int · ‎01-30-2010

Frank,

You are right - I am not right. I apologize for misleading.

Of course 1024 should be enough to avoid overlapping.

MKL really seems to have problem with choosing the optimal algorithm in this situation. It erroneously favors the overlap-add method and ends up performing a series of small 2D FFTs.

This issue will be fixed in one of our future releases.

Best regards,

-Vladimir