I have been used MKL on MIC Xeon Coprocessor like this :
#pragma offload target(mic:1) in(in:length(nx*ny))\ out(out:length(nx*ny)) { fftwf_plan temp= fftwf_plan_dft_2d(nx,ny,in, out, FFTW_BACKWARD, FFTW_ESTIMATE); fftwf_execute(temp); }
but I find that the code consumes almost the same time comparing with the fftw library on CPU.
I think it is amazing , how can I increase the efficiency of MKL on MIC？
Thanks！
Link Copied
The FFT algorithm does not have enough computation per data element to justify the time required to transfer the data across the PCIe interface to/from the Xeon Phi.
It is easy to estimate an upper bound on the "effective" GFLOPS rate for an offloaded FFT. Assume that one array is transferred to the Xeon Phi at 6 GB/s, then the computation is done instantaneously, then the array is transferred back to the host at 6 GB/s. For a 2^20 (1 Mi) element doublecomplex FFT, the time required is (2 transfers * 16 bytes/element * 1024*1024 elements)/(6 GB/s) = 0.0056 seconds. The nominal operation count for this FFT is 5*N*log(N,2)=100*2^20=104857600 operations. The corresponding rate is therefore 0.1048 billion operations in 0.0056 seconds, or 18.7 GFLOPS. Any actual execution time on the Xeon Phi would increase this time and decrease the corresponding rate. Transfer of any additional arrays in or out of the Xeon Phi would also increase the execution time and decrease the corresponding rate.
BUT, if the data can be kept on the Xeon Phi for many consecutive calculations, then the MKL FFTs can deliver high performance for problem sizes that have enough parallelism to keep the many cores busy. I have seen well over 100 GFLOPS for singlecomplex FFTs of length 2^20.
John McCalpin wrote:
The FFT algorithm does not have enough computation per data element to justify the time required to transfer the data across the PCIe interface to/from the Xeon Phi.
It is easy to estimate an upper bound on the "effective" GFLOPS rate for an offloaded FFT. Assume that one array is transferred to the Xeon Phi at 6 GB/s, then the computation is done instantaneously, then the array is transferred back to the host at 6 GB/s. For a 2^20 (1 Mi) element doublecomplex FFT, the time required is (2 transfers * 16 bytes/element * 1024*1024 elements)/(6 GB/s) = 0.0056 seconds. The nominal operation count for this FFT is 5*N*log(N,2)=100*2^20=104857600 operations. The corresponding rate is therefore 0.1048 billion operations in 0.0056 seconds, or 18.7 GFLOPS. Any actual execution time on the Xeon Phi would increase this time and decrease the corresponding rate. Transfer of any additional arrays in or out of the Xeon Phi would also increase the execution time and decrease the corresponding rate.
BUT, if the data can be kept on the Xeon Phi for many consecutive calculations, then the MKL FFTs can deliver high performance for problem sizes that have enough parallelism to keep the many cores busy. I have seen well over 100 GFLOPS for singlecomplex FFTs of length 2^20.
Many thanks for your reply. I have been learned a lot. But I have a question about the parallelism of the MKL FFT .
I can not assign the number of core to FFT clearly ,because the openmp must be used following the for loop .
When I use the offload code ,I want to know the FFT like the code just uses one core on MIC or uses free cores on MIC ?
Thanks again and kind regards.
 you may play with MIC_OMP_NUM_THREADS environment variable to use 1 or all 240 KNC's threads.

For more complete information about compiler optimizations, see our Optimization Notice.