Do MKL FFT performs bad on xeonphi(MIC) compared to multicore - xeon server for sizes not power of 2?

shiva_rama_krishna_b · ‎09-23-2014

Hi,

I have a written an application in which the same code is being on multi-core Xeon server CPU and Xeon phi 5110P(in offload mode). I noticed that when i use 512*512 size input array. i was getting 2X speed up on Xeon PHI compared CPU. But if i use 448*448 size input array. CPU is performing better than Xeon PHI.

The FFT routine i am using here takes 2 input arrays(real array and imaginary array) not a single complex array. And using single complex array always performs better than using seperate arrays(I confirmed it previously in this forum). So here the question, Is PHI not optimized when sizes are not powers of 2?

What can be done to improve FFT performance on xeon phi? I did the suggested optimizations in article (https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors. )except padding as i did not understand how to do it.

Skeleton of code looks like below.

#pragma offload target(mic)
{
      #pragma omp parallel num_threads(236)
       {
            int threadNum = omp_get_thread_num();
            float* realarray = RealArray + threadNum*offsetSize;
            float* imaginaryarray = RealArray + threadNum*offsetSize;
            MKl_FFT(realarray,imaginaryarray);
       }
}

Here 236 FFT routines run simultaneously. each threads makes a FFT call.

The same code performs 2X speed on PHI when it is of size 512*512. but it performs worst than CPU when size is 448*448.

Please some one give your valuable inputs on this.

Thanks

sivaramakrishna

TimP · ‎09-24-2014

Did you assure 64-byte alignment of the arrays, e.g. by keeping offsetSize at optimum value as well as allocating with alignment? Do you intend that realarray and imaginaryarray are the same?

I suppose optimum number of threads may vary with size. KMP_PLACE_THREADS=59c,4t could take the place of num_threads(236) but you may need to tinker with both values.

McCalpinJohn · ‎09-24-2014

I have not tested non-power-of-2 array sizes in the MKL FFT routines on Xeon Phi, but I typically see better throughput for memory-bandwidth-limited codes when using one thread per core rather than four threads per core.

Each of your 448x448 arrays takes about 1.53 MiB of memory, so you will spill out of the private L2 cache even when using one thread per core. If you do get better throughput with four threads per core than one thread, that would suggest that the MKL routine is not doing a good job of anticipating and hiding memory latency.

I would also worry about spilling out of L1 cache when running four threads. The row & column transforms on 448 elements require 3.5 KiB per vector. With four threads and two vectors per thread you would be using 28 KiB, which does not leave much room in the L1 Data Cache for the coefficients needed by the FFT algorithm.

If your data is stored contiguously, then both the N=448 and N=512 array sizes are "bad" for the padding rules (and 512 is a very bad leading dimension for most of the memory hierarchy -- 512 elements * 8 Bytes/element = 4096 Bytes, which maximizes conflicts in the L1 cache).

C is not the best language to use to understand padding, since it does not support 2D arrays in the same "native" way as Fortran, but the example code on the Intel web page you mentioned shows how to allocate the total size for a padded array. Then you just need to index your data inside this padded array using the correct leading dimension. For example, with N=448 and LEADING_DIM=56, a 2D reference to element i,j would be subarray(LEADING_DIM*j+i) rather than subarray(N*j+i). MKL uses the difference between the "size" parameter and the "DFTI_INPUT_STRIDES" and "DFTI_OUTPUT_STRIDES" parameters to handle this transformation internally.

Evgueni_P_Intel · ‎09-26-2014

Dear shiva rama krishna bharadwaj I.,

The MKL implementation of Fourier transforms is optimized for lengths that are products of powers of 2, 3, 5, 7, 11, 13.

MKL Fourier transforms are less optimized for complex data stored as real array + imaginary array than for complex data stored in a single array.

For single-precision complex matrix 448x448 stored in a single array, MKL Fourier transforms perform at the same level on Xeon and Xeon Phi.

Please submit a request at the Premier support if you need optimizations for real array + imaginary array in MKL Fourier transforms on Xeon Phi.

BTW, there seems to be a typing error in your code snippet.

float* realarray = RealArray + threadNum*offsetSize;
float* imaginaryarray = RealArray + threadNum*offsetSize; // ImaginaryArray?

Thank you.

Evgueni.

shiva_rama_krishna_b · ‎11-05-2014

Thank you Evgueni.