topic Re: FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL? in Intel® oneAPI Math Kernel Library

FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

klillevold — Thu, 02 Nov 2023 09:24:40 GMT

I have been using the FFTW3 wrapper code to implement DCT and DFT transforms in my code and it works great. Until recently I linked with the sequentual library. mkl_get_max_threads() naturally returns 1.

Now I have tried to link with the threaded library (TBB), and mkl_get_max_threads() returns the correct number of cores on my test systems - I have tried an AMD Ryzen 5 3600 (6 cores), an AWS instance (16 cores), and an M2 macBook Pro (8 cores).

However, there is no improvement in speed, and looking at the system load, it appears my program is utilizing only one thread.

So I surmise the FFTW3 MKL wrapper is not able to take advantage of multi-threading?

If I convert my code to use native Intel MKL DCT and DFT functions instead of the FFTW3 wrappers, will there be any advantage to be gained from multi-threaded linking?

Re: FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

klillevold — Sat, 04 Nov 2023 09:00:44 GMT

Further information, I am using 1-D transforms of size up to 3840, specifically fftwf_plan_r2r_1d() and fftwf_plan_dft_r2c_1d(). Test systems now also include an Intel processor.

Since the transforms are 1-dimensional and relatively small, I understand it might not be possible to run those transforms multi-threaded. I will have to implement threading in my own program and call the transforms in a parallel manner. I will read up thread safety for MKL and see if this is possible. Since these transforms are independent, this approach seems doable.

Re:FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

JilaniS_Intel — Tue, 07 Nov 2023 07:25:39 GMT

Hi,

Thanks for posting in Intel Communities.

We're glad to hear that the issue was resolved. If you have any further queries or concerns in future then please raise a new thread. We will be happy to help you. Thank you.

Regards,

Jilani

Re: FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

klillevold — Tue, 07 Nov 2023 15:30:45 GMT

[deleted]

Re: FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

klillevold — Tue, 07 Nov 2023 19:04:03 GMT

I apologize for deleting and then re-entering. I wanted to add more details. The issue has not been resolved.

I switched to using native MKL calls, and I created a Dfti descriptor handle to transform for example 100 transforms of 1440 size each.

I called DftiCreateDescriptor with float type, complex domain, one dimension. I set the parameters appropriately, including DFTI_NUMBER_OF_TRANSFORMS to 100. I now get the exact same numeric output from calling the forward transform once instead of 100 times sequentially.

Those transforms are independent and could potentially be run in parallel, yet I see that the process does not use any more threads than when linked with the sequential library, and the execution speed on a multi-core system is exactly the same.

Re: FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

klillevold — Wed, 08 Nov 2023 18:16:10 GMT

I figured out the problem after I finally found the right documentation.

https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-linux/2023-1/openmp-threaded-functions-and-problems.html#FFT

Multi-threading for FFT is only available under very limited conditions.

For example, the transform length has to be 2^N with N > 9, and one has to use double instead of single precision.

I created a test video with a resolution of 2048x2048, linked with OpenMP instead of TBB, and switched from float to double. This means that I run 512 complex to complex transforms of length 2048 per image of the video.

I can now see that threads are created, and on my 6 and 8-core test systems, I can see that all cores are fully utilized when I run my program.

However, it runs slightly slower than when using a single thread only. It is, therefore, more effective to let it run single-threaded, and leave the under-utilized cores available for other tasks. It will also use less memory, and I don't have to worry about extending the transform lengths from normal video sizes.

Re:FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

JilaniS_Intel — Mon, 13 Nov 2023 16:33:13 GMT

Hi,

Thank you for your response.

In consideration of your prior response, we understand that your issue has been resolved. Could you please confirm us the same. Thank you.

Regards,

Jilani

Re:FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

JilaniS_Intel — Mon, 20 Nov 2023 11:11:20 GMT

Hi,

A gentle reminder:

We haven't received any updates from you. Based on your previous response, it appears that your issue has been resolved. Could you kindly confirm this for us?

Regards,

Jilani

Re: FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

klillevold — Mon, 20 Nov 2023 16:09:39 GMT

Thanks - consider it resolved.

Re:FFTW3 wrapper gains no speedup from multi-threaded linking, convert to native MKL?

JilaniS_Intel — Wed, 22 Nov 2023 05:13:04 GMT

Hi,

Thanks for the confirmation.

It’s great to know that the issue has been resolved, in case you run into any other issues please feel free to create a new thread.

Regards,

Jilani