Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
6956 Discussions

how can I increase the efficiency of MKL on MIC?

王_子_
Beginner
282 Views

I have been used  MKL on MIC Xeon  Coprocessor like this :

	#pragma offload target(mic:1) in(in:length(nx*ny))\
				      out(out:length(nx*ny))
	{
		fftwf_plan temp= fftwf_plan_dft_2d(nx,ny,in, out, FFTW_BACKWARD, FFTW_ESTIMATE);
		fftwf_execute(temp);
	}

  but I find that the code consumes almost the same time comparing with the fftw library on CPU.

I think it is amazing , how can I increase the efficiency of MKL on MIC?

Thanks!

0 Kudos
3 Replies
McCalpinJohn
Honored Contributor III
282 Views

The FFT algorithm does not have enough computation per data element to justify the time required to transfer the data across the PCIe interface to/from the Xeon Phi. 

It is easy to estimate an upper bound on the "effective" GFLOPS rate for an offloaded FFT.  Assume that one array is transferred to the Xeon Phi at 6 GB/s, then the computation is done instantaneously, then the array is transferred back to the host at 6 GB/s.   For a 2^20 (1 Mi) element double-complex FFT, the time required is (2 transfers * 16 bytes/element * 1024*1024 elements)/(6 GB/s) = 0.0056 seconds.    The nominal operation count for this FFT is 5*N*log(N,2)=100*2^20=104857600 operations.    The corresponding rate is therefore 0.1048 billion operations in 0.0056 seconds, or 18.7 GFLOPS.   Any actual execution time on the Xeon Phi would increase this time and decrease the corresponding rate.  Transfer of any additional arrays in or out of the Xeon Phi would also increase the execution time and decrease the corresponding rate.

BUT, if the data can be kept on the Xeon Phi for many consecutive calculations, then the MKL FFTs can deliver high performance for problem sizes that have enough parallelism to keep the many cores busy.  I have seen well over 100 GFLOPS for single-complex FFTs of length 2^20.

0 Kudos
王_子_
Beginner
282 Views

John McCalpin wrote:

The FFT algorithm does not have enough computation per data element to justify the time required to transfer the data across the PCIe interface to/from the Xeon Phi. 

It is easy to estimate an upper bound on the "effective" GFLOPS rate for an offloaded FFT.  Assume that one array is transferred to the Xeon Phi at 6 GB/s, then the computation is done instantaneously, then the array is transferred back to the host at 6 GB/s.   For a 2^20 (1 Mi) element double-complex FFT, the time required is (2 transfers * 16 bytes/element * 1024*1024 elements)/(6 GB/s) = 0.0056 seconds.    The nominal operation count for this FFT is 5*N*log(N,2)=100*2^20=104857600 operations.    The corresponding rate is therefore 0.1048 billion operations in 0.0056 seconds, or 18.7 GFLOPS.   Any actual execution time on the Xeon Phi would increase this time and decrease the corresponding rate.  Transfer of any additional arrays in or out of the Xeon Phi would also increase the execution time and decrease the corresponding rate.

BUT, if the data can be kept on the Xeon Phi for many consecutive calculations, then the MKL FFTs can deliver high performance for problem sizes that have enough parallelism to keep the many cores busy.  I have seen well over 100 GFLOPS for single-complex FFTs of length 2^20.

Many thanks for your reply. I have been learned a lot. But I have a question about the parallelism of the MKL FFT .

I can not assign the number of core to FFT clearly ,because the openmp must be used following  the for loop .

When  I use the offload code ,I want to know the FFT like the code  just uses one core on MIC or uses free cores on MIC ?

Thanks again and kind regards.

0 Kudos
Gennady_F_Intel
Moderator
282 Views

 - you may play with MIC_OMP_NUM_THREADS environment variable to use 1 or all 240 KNC's threads. 

-

0 Kudos
Reply