Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.
6739 Discussions

Performance of offloaded MKL FFTs on the MIC, anyone?


My initial experiments offloading MKL FFTs into the MIC (using C language in Linux) give me approximately 9.3 GFLOPS of performance, judging by the reported [MIC Time] numbers when I set the environment variable  OFFLOAD_REPORT to 1 (or 2). This is about 0.46% of the advertized peak performance of 2 TFLOPS. But in fact, it is much less than that if I take into account the time for the data movement inside the offload section [CPU Time in the "report").

Am I missing something?

I am curious to know if my numbers are way off or consistent with other benchmarks (I could not find any).

I would appreciate it if someone could point me to related information or to know if someone had a different (or similar) experience.

The bottom line is that I hope I need to do something to drastically improve its performance, but I ran out of ideas. Any help will be appreciated.




0 Kudos
1 Reply

What type of FFT are you doing? data sizes? Are you offloading individual FFTs one-by-one, or batch FFTs?

See this knowledge base article for getting good FFT performance on MIC:

And where do you get the info that the peak performance of Phi is 2 TFLOPS? As far as I know, the high-end Intel Xeon Phi 7120 with 61 cores has only a peak performance about 1.2 TFLOPS. FFT is typically a memory bound computation. It's not reasonable to expect FFT performance to be close to theoretical peak. If you run something like Linpack benchmark or matrix-matrix multiplication then you'll be able to get much closer to the peak.