MKL's FFT runs slower on modern AWS Xeon instance than 12 year old i5-2500k

klillevold · ‎11-10-2023

I have a simple program that does DCT-II and real to complex FFT transforms on video inputs.

It is currently linked single-threaded, as I found it ran slower with OpenMP FFT multi-threading than without, and the severe restrictions on multi-threading for FFT added unforeseen complications. It was also impossible to multi-thread the DCT-II transforms (via FFTW3 wrapper).

The transform lengths correspond to typical video widths and heights, for example 1920 and 1080.

Currently the program runs twice slower on these modern AWS instances with AVX-512 than my Macbook M2 Pro in Rosetta emulation mode with SSE4.2, and 20% slower than on my venerable 12 year old Intel i5-2500k processor with AVX.

Does anyone have advice they can offer on this situation?

It must be partially caused by the clock penalty in these multi-core Xeon processors, and I guess the next step would be to add multi-threading in my own program in the layer above the transforms.

JilaniS_Intel · ‎11-13-2023

Hi,

Thanks for posting in Intel Communities.

Could you please provide us the performance statistics that were compared between the mentioned hardware? We would like to request you for a sample reproducer to check the behavior at our end. Thank you.

Regards,

Jilani

klillevold · ‎11-15-2023

Thank you for your reply. I am working on multi-threading my application in the layer above Intel MKL functions.

After further consideration, I think the performance numbers are as can be expected, albeit surprising to begin with. MKL functions run super fast and appear incredibly well-optimized.

My old and still running strong i5-2500k runs at 4.5GHz, while the AWS instances run at a much lower clock rate, and single-threaded performance will therefore suffer a significant penalty.

I have one question:

When MKL reports "Intel(R) architecture processors" for AMD processors, which instruction set is being used? The message is the same on AWS (AMD Epyc) and my personal AMD Ryzen 5.

The application will be running on c7i.8xlarge (Intel Xeon). I will measure more carefully when multi-threading is completed.

klillevold · ‎11-17-2023

I finished the multi-threading of my app. It now spawns threads in the layer above FFT (via MKL), and DCT-II (via FFTW3 wrapper), which enables it to work for non power of 2 transform sizes as well as float precision. I am seeing great threading performance up to around 8 threads, with marginal gains up to 16.

Overall the performance is great on Xeon AWS instances (as well as my personal computers, both older Intel and newer AMD processors). I am still curious which instruction set MKL decides to use under the hood for AMD ("Intel(R) architecture processors"), but it really doesn't matter. Performance is great no matter.

This thread can be considered resolved.

JilaniS_Intel · ‎11-23-2023

Hi,

We're glad to hear that the issue was resolved. If you have any further queries or concerns in future then please raise a new thread. We will be happy to help you. Thank you.

Have a great day.

Regards,

Jilani