Fastest way to do 3D FFT for correlation?

jheyd · ‎05-24-2007

Hi everybody,

I'm contemplating switching from gcc/fftw to icc/mkl. The central loop in my biophysics code is a FFT-accelerated 3D translational search. So I need to forward FFT, multiply, backward FFT for every possible orientation of the probe structure (to perform an exhaustive 6D search). Usually, we work with 2000-20000 angles so the FFT performance is crutial. Matrix sizes are between 50^3 and 200^3.

Searching through the archives here, I already saw that radixes 2, 3, 5, 7, and 11 are supported. I'll modify the code to not allow a single factor of 13 (which FFTW supports). Looking through the documentation, though, I couldn't find any real hints about what the fastest options for my application would be.

I definitely need non-power-of-two transforms (but padding to the nearest best size in each dimension given the supported radixes).

I guess I want DFTI_BACKWARD_SCRAMBLED?

How about the transposition setting (DFTI_ALLOW)? I could deal with a transposed solution since I have to digest it, anyway.

What's the fastest data layout? Coming from FFTW, DFTI_CONJUGATE_EVEN_STORAGE and DFTI_COMPLEX_COMPLEX
is what the wrapper suggests and that would make MKL a drop in replacement. However, is the the fastest storage layout?

Thanks a lot in advance for any pointers about this. I'd appreciate any other ideas you have, as well.

Best,

JJ

Intel_C_Intel · ‎05-29-2007

Thanks of considering changing to the use of MKL. Your questions are interesting, but I am not certain I can answer them off the top of my head.

Certainly keeping data in the scrambled order and avoiding a transposition of the data back to its original orientation will improve performance. However, the reordering of data will by small gains while tranposing data will give better returns.

I will ask the folk who actually develop the code about optimal data storage schemes.

Bruce

jheyd · ‎05-30-2007

Dear Bruce,

thanks a lot for following up on this.

I just looked through the MKL reference manual again and the info on the transposition option is a bit sparse. It says "an arbitrary permutation of the three dimensions". So this is a straight transpose (along some dimensions), right?

My code does

real 3D matrix A -FFT-> complex A*
|--multiply-> complex C* -FFT-> real C
real 3D matrix B -FFT-> complex B*

to calculate the correlation as a function of translation of some object in A. I don't care about the order of A* or B*, as long as both are the same. I also don't really care about the order of C, as long as it's consistent. I guess the only variable there is the size of the dimensions that A and B (same dimensions, of course) have. I process C anyway and can take any transposition into account. However, if it's not always the same, I need to know how it got transposed...

Almost forgot: What's faster: in place or out of place?

Thanks again,

JJ

nmustaev · ‎05-31-2007

Hi JJ

I'm glad to help you.

As I understand you call Complex-to-Real 3D FFT.

Threre are some FFT examples which you can see and run.

You can see examples of out-of-place Complex-to-Real 3D FFT in examplesdftcsource eal_3d_cce_single_ex2.c, examplesfftw3xsource eal_3d_cce_single_ex2.c, examplesfftw2xsource eal_3d_ex2.c for MKL DFTI, FFTW3.x and FFTW2.x correspondingly.

Real FFT is computed with ordered output. 3D FFT is computed by 1D FFTfor each dimension.

Sorry I didn't find "an arbitrary permutation of the three dimensions" in MKL manual. Could you say the chapterand page which you saw one in?

As regards the performance in general MKL bits FFTW on3D FFT except on small matrices (less 25x25x25). Especially on threaded mode.

As usualin-place 3D FFT is equal and better out-of-place.

Please ask meyour questions more.

Regards,

Nadezhda

jheyd · ‎05-31-2007

Dear Nadezhda,

Thanks for helping me with this. I've looked at the examples and I don't have any problems with using MKL as a drop-in replacement for FFTW. My questions mainly revolve around the optimal way of doing the above correlation calculation with the MKL.

If you could please confirm/answer the following statements/questions with regard to the optimal FFT setup.

Dimensions of 3D array should be multiples of 2, 3, 5, 7, and 11.

DFTI_BACKWARD_SCRAMBLED will make things faster? In your previous reply, you seem to imply that the output is aways ordered. So this option doesn't have any effect in my case?

Which is the fastest data layout: CCE, CCS, Pack or Perm?

In-place transforms are faster than out-of-place? (sorry, I didn't quite understand your previous reply) How big is the speed difference?

DFTI_ALLOW will make things faster?

About the transposition/DFTI_ALLOW: On page 11-46 of the MKL Reference Manual it says:

Note also that in dimension 3 and above, transposition means an
arbitrary permutation of the dimension.

That's all the information it gives. Is there a way to tell which dimensions are transposed? How arbitrary is it? I mean, does any 3D FFT, regardless of length of the individual dimensions, produces the same transposition? If so, it can easiely be dealt with. OTOH, if different dimension lengths produce different permutations, it would be difficult. The "arbitrary" above suggests the latter case... If so, it would be crucial to know which transpositions took place.

Thanks again for your insight into these issues.

Best,

JJ

jheyd · ‎06-06-2007

I just read in the MKL 9.1 release notes and they say the following about TRANSPOSE:

Mode DFTI_TRANSPOSE is implemented only for the default case.

Does that mean the question about this feature is not relevant?

I'm still interested in the answers to the other questions, though.

TIA.

Best,

JJ

nmustaev · ‎06-19-2007

Hi JJ,

Please see my comments below.

Dimensions of 3D array should be multiples of 2, 3, 5, 7, and 11.

Yes, it should. We have these optimized radixes.

DFTI_BACKWARD_SCRAMBLED will make things faster? In your previous reply, you seem to imply that the output is aways ordered. So this option doesn't have any effect in my case?

Yes, output is ordered for Real FFT.

Which is the fastest data layout: CCE, CCS, Pack or Perm?

CCE format is more optimized for multidimensional FFT.

In-place transforms are faster than out-of-place? (sorry, I didn't quite understand your previous reply) How big is the speed difference?

It's difficult to say exactly without processor information. Please let me know what processor you use.

DFTI_ALLOW will make things faster?

About the transposition/DFTI_ALLOW: On page 11-46 of the MKL Reference Manual it says:

Note also that in dimension 3 and above, transposition means an
arbitrary permutation of the dimension.

DFTI_TRANSPOSE has DFTI_NONE value only now. But DFTI is being developed. You can open feature request in QUAD. As usual we meet customer's wishes as far as possible. DFTI_ALLOW may optimize FFT because of avoiding of extra copies. But I have some comments. At first this approach demands on a lot of additional memory. One additional array in in-place 2D FFT and out-of-place 3D FFT, two additional arraysin in-place 3D FFT. Otherwise small additional arrays with extra copiesareinevitable and there is no any benefit. Second we will have T_OUT(j,i) = OUT(i,j)for 2D FFT and T_OUT(k,i,j) = OUT(i,j,k) for 3D FFT. Sonot T_OUT(k,,j, i) and this fact means arbitrary permutation.

Regards,

Nadezhda

jheyd · ‎06-29-2007

Hi Nadezhda,

Thanks a lot for your reply. That cleared up some things.

In-place transforms are faster than out-of-place? (sorry, I didn't quite understand your previous reply) How big is the speed difference?

It's difficult to say exactly without processor information. Please let me know what processor you use.

I'm mainly targeting the newer stuff: Core 2 and Woodcrest Xeons. However, we have a number of Athon64 X2, as well. I'm willing to spend the extra memory if I can get a 10% or bigger speed improvement by using out-of-place.

Otherwise small additional arrays with extra copiesareinevitable and there is no any benefit. Second we will have T_OUT(j,i) = OUT(i,j)for 2D FFT and T_OUT(k,i,j) = OUT(i,j,k) for 3D FFT. Sonot T_OUT(k,,j, i) and this fact means arbitrary permutation.

"Arbitrary" somehow sounds unpredictable while this isn't (as expected). So that would be a viable option. I think I'll open the request. Again, the extra memory might be worth it. Any clue what kind of speed improvement this might bring?

I'll start the implementation with MKL next week. So I might be able come up with some more questions... ;-)

Best,

JJ