topic Hi Dima, in Intel® oneAPI Math Kernel Library

MKL DFT descriptor generation question

hello_world — Tue, 06 Aug 2013 17:05:52 GMT

Hi there,

I have a question about the DFTI descriptor.

So the problem is 1Kx1K complex number, row major. for each row of 1K element, I would like to compute size-16 FFT with stride 64. That is - I do not want to compute size -1024 FFT but only size-16 FFT.

For example: these 16- elements are element 0, 64, 128, 192, ... 1008. and another size-16 FFT elements are element 1, 65, 129, ... 1009, etc.

And the same computation is applied on all the 1K rows.

I had a look at the reference manual but am not sure if the descriptor could generate that.

specifically, I don't know arguments like:

1) num_of_transforms 2) stride, 3) dist.

Thanks!

Jing

Please take a look at MKL

SergeyKostrov — Wed, 07 Aug 2013 01:10:00 GMT

Please take a look at MKL examples for DftiComputeForward and DftiComputeBackward functions. Also, there is a thread related to some normalization issues of these functions and it is http://software.intel.com/en-us/forums/topic/402439.

Hi Jing,

Dmitry_B_Intel — Wed, 07 Aug 2013 01:52:00 GMT

Hi Jing,

The following lines should guide you to the desired computation:

[cpp]

MKL_LONG size = 16;
MKL_LONG strides[] = { 0, 64 };
MKL_LONG ntransforms = 64;

DftiCreateDescriptor(&h, ..., 1, size); // = I would like to compute size-16 FFT
DftiSetValue(h, DFTI_INPUT_STRIDES, strides ); // = with stride 64
DftiSetValue(..., DFTI_NUMBER_OF_TRANSFORMS, ntransforms ); // compute 64 ffts of one row
DftiCommitDescriptor(...);

for (rowno=0;rowno<1024;++rowno) DftiComputeForward(h,&data[rowno*rowsize]);

[/cpp]

Thanks
Dima

Hi Dima,

hello_world — Wed, 07 Aug 2013 02:20:28 GMT

Hi Dima,

Thanks for your reply - I thought of that - but thought the performance of using for loop would be really bad. I just ran the code according to your guideline and the performance is way worse than 1024*64 number of size-16 FFT if assuming consecutive memory stride. Since the FLOPS are realtively small and I thought the batched execution may be able to exploit the memory and cache pretty good for stride(0, 64) as it is when stride (0, 1) is used.

Do you have any suggestions to tune the performance?

Thanks!!

Jing

Dmitry Baksheev (Intel) wrote:

Hi Jing,

The following lines should guide you to the desired computation:

MKL_LONG size = 16; MKL_LONG strides[] = { 0, 64 }; MKL_LONG ntransforms = 64; DftiCreateDescriptor(&h, ..., 1, size); // = I would like to compute size-16 FFT DftiSetValue(h, DFTI_INPUT_STRIDES, strides ); // = with stride 64 DftiSetValue(..., DFTI_NUMBER_OF_TRANSFORMS, ntransforms ); // compute 64 ffts of one row DftiCommitDescriptor(...); for (rowno=0;rowno<1024;++rowno) DftiComputeForward(h,&data[rowno*rowsize]);

Thanks
Dima