Why MPI impact the speed of MKL's DFT

杨_栋_ · ‎04-19-2017

My code:

// -*- C++ -*-

# include <cmath>
# include <ctime>
# include <cstring>
# include <cstdio>

# include "mkl.h"

int main (int argc, char * argv[])
{
  MKL_LONG D[2] = {SIZE, SIZE};
  MKL_LONG C = COUNT;
  MKL_LONG ST[3] = {0, (D[1] * sizeof(double) + 63) / 64 * (64 / sizeof(double)), 1};
  MKL_LONG DI = D[0] * ST[1];
  MKL_LONG SI = D[0] * ST[1];
  double SC = 1.0 / std::sqrt((double)SI);
  struct timespec BE, EN;
  
  double*const Efft_r = (double*)_mm_malloc(sizeof(double) * SI * C * 2, 64);
  memset(Efft_r, 0, sizeof(double) * SI  * C * 2);
  double*const Efft_i = Efft_r + SI * C;

  Efft_r[0] = 1.0;

  clock_gettime (CLOCK_REALTIME, &BE);
  for (int i=0; i<LOOP; ++i)
    {
      MKL_LONG status;
      DFTI_DESCRIPTOR_HANDLE hand;
      DftiCreateDescriptor(&hand, DFTI_DOUBLE, DFTI_COMPLEX, 2, D);
      DftiSetValue(hand, DFTI_INPUT_STRIDES, ST);
      DftiSetValue(hand, DFTI_OUTPUT_STRIDES, ST);
      DftiSetValue(hand, DFTI_NUMBER_OF_TRANSFORMS, C);
      DftiSetValue(hand, DFTI_INPUT_DISTANCE, DI);
      DftiSetValue(hand, DFTI_COMPLEX_STORAGE, DFTI_REAL_REAL);
      DftiSetValue(hand, DFTI_FORWARD_SCALE, SC);
      DftiSetValue(hand, DFTI_BACKWARD_SCALE, SC);
      DftiSetValue(hand, DFTI_THREAD_LIMIT, 1);
      DftiSetValue(hand, DFTI_NUMBER_OF_USER_THREADS, 1);
      DftiCommitDescriptor(hand);
      __assume_aligned(Efft_r, 64);
      __assume_aligned(Efft_i, 64);
      DftiComputeForward(hand, Efft_r, Efft_i);
      DftiFreeDescriptor(&hand);
    }
  clock_gettime (CLOCK_REALTIME, &EN);
  printf("DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL\nLOOP:   \t%d\nSIZE:   \t%d X %d\nSTRIDES:\t%d %d %d\nNUMBER: \t%d\nDISTANCE:\t%d\n\t\t\t\t%.9fs\n",
	 LOOP,
	 D[0], D[1],
	 ST[0], ST[1], ST[2],
	 C,
	 DI,
	 double(EN.tv_sec-BE.tv_sec)+double(EN.tv_nsec-BE.tv_nsec)/1e9);
  _mm_free(Efft_r);

  return 0;
}

This code was compiled by icpc with flag "-mkl DSIZE=4096 -DLOOP=1 -DCOUNT=3".

When I run this program without MPI, the output is below:

$ ./a.out
DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL
LOOP:   	1
SIZE:   	4096 X 4096
STRIDES:	0 4096 1
NUMBER: 	3
DISTANCE:	16777216
				0.322017125s

When I run the same program with MPI, the output is below:

$ mpirun -n 1 ./a.out
DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL
LOOP:   	1
SIZE:   	4096 X 4096
STRIDES:	0 4096 1
NUMBER: 	3
DISTANCE:	16777216
				1.606980538s

The program without MPI runs much faster than with MPI. I have tried different value of SIZE, but the results are alike.

I have not known why. If I must use MPI, is there any way to keep the speed of MKL?

Jing_Xu · ‎04-20-2017

We are investigating. We will get back to you.

Jing_Xu · ‎04-20-2017

Did you use https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to get the compling and linking switches?

Zhen_Z_Intel · ‎04-20-2017

Hi YangDong,

I am afraid you are not using FFT cluster computing functions & descriptor config function, that data would not be distributed correctly to calculate. For your code implement you probably need to use 'DftiComputeForwardDM' and 'DftiSetValueDM'.

Another point is, I am not sure if you are thread safe or not. If each node could modify the time calculation, the time you print is not actually for main node, but for all node calculation time. I recommend to use MPI interface (mpi_Wtime) to calculate time usage.

Best regards,
Fiona

杨_栋_ · ‎04-28-2017

Jing X. (Intel) wrote:

Did you use https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor to get the compling and linking switches?

Thank you!

I have solved this problem. I added MPI_Init() and MPI_Finalize() to the code. After the code was compiled by mpiicpc, the program without MPI runs as fast as the program with MPI.

杨_栋_ · ‎04-28-2017

Fiona Z. (Intel) wrote:

Hi YangDong,

I am afraid you are not using FFT cluster computing functions & descriptor config function, that data would not be distributed correctly to calculate. For your code implement you probably need to use 'DftiComputeForwardDM' and 'DftiSetValueDM'.

Another point is, I am not sure if you are thread safe or not. If each node could modify the time calculation, the time you print is not actually for main node, but for all node calculation time. I recommend to use MPI interface (mpi_Wtime) to calculate time usage.

Best regards,
Fiona

hi Fiona,

thanks for your warning.

Single node has private hand and memory space. If every node run independent at the same time, is it thread safe?

Ying_H_Intel · ‎05-15-2017

Hi YongDong,

Your FFT code looks fine and do 2D complex to complex FFT on single machine.

The program without MPI runs much faster than with MPI because that MKL FFT is multi-threaded by OpenMP, when you use mpirun to invoke the MKL FFT, it will ignore the OpenMP threads by default. So for same performance of the program without MPI,

you may try

> export OMP_NUM_THREADS=xx (your number of physical cores)

> then mpirun -n 1 ./a.out

Here is what i run:

[yhu5_new@hsw-ep01 FFT]$ export OMP_NUM_THREADS=36
[yhu5_new@hsw-ep01 FFT]$ ./a.out
DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL
LOOP:           1
SIZE:           4096 X 4096
STRIDES:        0 4096 1
NUMBER:         3
DISTANCE:       16777216
                                0.141292946s
[yhu5_new@hsw-ep01 FFT]$ mpirun -n 1 ./a.out
DFTI_COMPLEX_STORAGE: DFTI_REAL_REAL
LOOP:           1
SIZE:           4096 X 4096
STRIDES:        0 4096 1
NUMBER:         3
DISTANCE:       16777216
                                0.145391448s

As i understand, your code is no mpi program. so it is not needed to run by mpirun actually. For performance tips, you may refer to MKL user guide or the article https://software.intel.com/en-us/articles/tuning-the-intel-mkl-dft-functions-performance-on-intel-xeon-phi-coprocessors (which is for Xeon phi, but the conceptions are same for other processors)

If you'd like to use MPI , then you need MPI programing, and call Cluster MKL FFT with huge FFT size on multi-nodes. and your may find the cluster FFT sample code under MKL install folder: examples_cluster_c.tgz
Unzip it and see cdftc/source/dm_complex_2d_double_ex1.c

Please refer MKL user guide for more details. https://software.intel.com/en-us/mkl-macos-developer-guide-linking-with-intel-mkl-cluster-software

Best Regards,

Ying

SergeyKostrov · ‎05-17-2017

>>...If I must use MPI, is there any way to keep the speed of MKL?... To improve performance of processing you can consider: - Use scatter attribute for KMP_AFFINITY environment variable - Place data sets into MCDRAM memory instead of DDR4 if a KNL system is used ( for Flat or Hybrid MCDRAM modes ). A speed up could be significant and here are two examples:

SergeyKostrov · ‎05-17-2017

///////////////////////////////////////////////////////////////////////////////
// 16384 x 16384 - Processing using DDR4

 Strassen HBI
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold :  8192 x  8192
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: N/A
 Calculating...
 Strassen HBI - Pass 01 - Completed:     6.97700 secs
 Strassen HBI - Pass 02 - Completed:     6.71200 secs
 Strassen HBI - Pass 03 - Completed:     6.30400 secs
 Strassen HBI - Pass 04 - Completed:     6.28600 secs
 Strassen HBI - Pass 05 - Completed:     6.35500 secs
 ALGORITHM_STRASSENHBI - Passed

///////////////////////////////////////////////////////////////////////////////
// 16384 x 16384 - Processing using MCDRAM

 Strassen HBI
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold :  8192 x  8192
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: N/A
 Calculating...
 Strassen HBI - Pass 01 - Completed:     4.88600 secs
 Strassen HBI - Pass 02 - Completed:     4.27700 secs
 Strassen HBI - Pass 05 - Completed:     4.24900 secs
 Strassen HBI - Pass 03 - Completed:     4.24000 secs
 Strassen HBI - Pass 04 - Completed:     4.24800 secs
 ALGORITHM_STRASSENHBI - Passed

SergeyKostrov · ‎05-17-2017

///////////////////////////////////////////////////////////////////////////////
// 16384 x 16384 - Processing using DDR4

 Strassen HBC
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold :  8192 x  8192
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: Disabled
 Calculating...
 Strassen HBC - Pass 01 - Completed:     6.92900 secs
 Strassen HBC - Pass 02 - Completed:     6.80300 secs
 Strassen HBC - Pass 03 - Completed:     6.76300 secs
 Strassen HBC - Pass 04 - Completed:     6.84800 secs
 Strassen HBC - Pass 05 - Completed:     6.78500 secs
 ALGORITHM_STRASSENHBC - 1 - Passed

///////////////////////////////////////////////////////////////////////////////
// 16384 x 16384 - Processing using MCDRAM

 Strassen HBC
 Matrix Size           : 16384 x 16384
 Matrix Size Threshold :  8192 x  8192
 Matrix Partitions     :     8
 Degree of Recursion   :     1
 Result Sets Reflection: Disabled
 Calculating...
 Strassen HBC - Pass 01 - Completed:     5.03100 secs
 Strassen HBC - Pass 03 - Completed:     4.96100 secs
 Strassen HBC - Pass 05 - Completed:     4.94200 secs
 Strassen HBC - Pass 03 - Completed:     4.96200 secs
 Strassen HBC - Pass 04 - Completed:     4.95400 secs
 ALGORITHM_STRASSENHBC - 1 - Passed