If I set DFTI_NUMBER_OF_TRANSFORMS to 4 on a AVX computer, or 8 on a AVX-512 KNL, will MKL's DftiComputeForward/Backward compute the FFT's of similar but independant, non-overlapping arrays simultaneously in SIMD or sequentially one after the other?
There's no direct relationship between value of DFTI_NUMBER_OF_TRANSFORMS and SIMD. The DFTI_NUMBER_OF_TRANSFORMS is actually for performing a number of FFT transforms with a single call. It is similar to writing a for loop to perform FFT backward/forward N times.
MKL FFT supports configuration setting variables to control parallel processing. You could use DFTI_THREAD_LIMIT to set parallel or sequential for each transform of single call methods (DFTI_NUMBER_OF_TRANSFORMS>1) when MKL is parallel mode.
By default, the FFT processing is parallel for large size, but sequential for small transform. If you are using a bunch of small transforms, each FFT transform would be sequential. But if you are using a bunch of large transform and DFTI_THREAD_LIMIT!=1, each transform would be parallel.
Thanks for the explanaition! I'm wondering if, for a number of small single precision FT's e.g. 24x24, it would be more efficient or less to run them in parallel in SIMD (in each thread), in particular with AVX-512. Has this been investigated by Intel?
If you are using a bunch of small transforms, where function call overhead comprises a noticeable part of the transform time, doing the bunch within a single call by DFTI_NUMBER_OF_TRANSFORMS probably would be more efficient. Thanks.