My initial experiments offloading MKL FFTs into the MIC (using C language in Linux) give me approximately 9.3 GFLOPS of performance, judging by the reported [MIC Time] numbers when I set the environment variable OFFLOAD_REPORT to 1 (or 2). This is about 0.46% of the advertized peak performance of 2 TFLOPS. But in fact, it is much less than that if I take into account the time for the data movement inside the offload section [CPU Time in the "report").
Am I missing something?
I am curious to know if my numbers are way off or consistent with other benchmarks (I could not find any).
I would appreciate it if someone could point me to related information or to know if someone had a different (or similar) experience.
The bottom line is that I hope I need to do something to drastically improve its performance, but I ran out of ideas. Any help will be appreciated.
What type of FFT are you doing? data sizes? Are you offloading individual FFTs one-by-one, or batch FFTs?
See this knowledge base article for getting good FFT performance on MIC: https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xe...
And where do you get the info that the peak performance of Phi is 2 TFLOPS? As far as I know, the high-end Intel Xeon Phi 7120 with 61 cores has only a peak performance about 1.2 TFLOPS. FFT is typically a memory bound computation. It's not reasonable to expect FFT performance to be close to theoretical peak. If you run something like Linpack benchmark or matrix-matrix multiplication then you'll be able to get much closer to the peak.