Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Comparing FFT Performance MKL11 with 1 thread and 4 threads

Marian_L_
Beginner
3,547 Views

Hi All,

I'm evaluating the performance (this time not MKL6 vs MKL11) of MKL11 with 1 thread versus 4 threads.

The 4 thread version seems to be slower. Furthermore, the 4 thread implementation has a huge number of outliners. Does anyone have any explanations, why?

Below the source (float and double are similar), I shortened it for better overview.

Main function:

int _tmain(int argc, _TCHAR* argv[])
{
   int threads = 4; //or 1
   mkl_set_num_threads(threads);

   SetPriorityClass(GetCurrentProcess(), HIGH_PRIORITY_CLASS ); // Set a process priority to 'High'

   TEST FUNCTION HERE

   SetPriorityClass( GetCurrentProcess(), NORMAL_PRIORITY_CLASS ); // Restore the process priority to 'Norma'l
}

TEST FUNCTION

  DFTI_DESCRIPTOR_HANDLE hand;
   cxdTimeLoops.alloc(loops);

   //    FLOAT
   k=0;
   for (exp=exp_start;exp<=exp_stop;exp++)
   {
      Nfft = (unsigned int) pow(2.0,exp);
     
      myRndNumber = 1; //seed
     
      for (i=0;i<Nfft;i++) //get pseudo random signal
      {
         myRndNumber    = NextRand32(myRndNumber);
         cxfTimesig  = ((float) myRndNumber / UINT_MAX)*2-1;
         cxfTimeaxis = ((float) i + 1.0) / fs;
      }
      hand   = 0;
      status = DftiCreateDescriptor(&hand, DFTI_SINGLE, DFTI_REAL, 1, Nfft);
      status = DftiSetValue(hand, DFTI_PLACEMENT, DFTI_NOT_INPLACE);
      status = DftiCommitDescriptor(hand);
      
      for (i=0;i<loops;i++)
      {
         hpfcTimer.Start(); //start timer for single execution
         status = DftiComputeForward(hand, cxfTimesig.ptr(), cxfFreqsig.ptr());
         cxdTimeLoops = hpfcTimer.Time();
      }
      DftiFreeDescriptor(&hand);
      
      dTimeMax = 0;
      dTimeMin = cxdTimeLoops[0];
      dTimeAvg = 0;
      for (i=0;i<loops;i++)
      {
         dTimeAvg += cxdTimeLoops;
         dTimeMax = max(cxdTimeLoops,dTimeMax);
         dTimeMin = min(cxdTimeLoops,dTimeMin);
      }
      dTimeAvg /= (double) loops;
      k++;
   }

   //    DOUBLE
   k=0;
   for (exp=exp_start;exp<=exp_stop;exp++)
   {
      Nfft = (unsigned int) pow(2.0,exp);
      cxdFreqsig.alloc(Nfft);
      cxdTimesig.alloc(Nfft);
      cxdTimeaxis.alloc(Nfft);
      
      myRndNumber = 1; //seed
      for (i=0;i<Nfft;i++)      //get pseudo random signal
      {
         myRndNumber    = NextRand32(myRndNumber);
         cxdTimesig  = ((double) myRndNumber / UINT_MAX)*2-1;
         cxdTimeaxis = ((double) i + 1.0) / fs;
      }

      hand   = 0;
      status = DftiCreateDescriptor(&hand, DFTI_DOUBLE, DFTI_REAL, 1, Nfft);
      status = DftiSetValue(hand, DFTI_PLACEMENT, DFTI_NOT_INPLACE);
      status = DftiCommitDescriptor(hand);
      
      for (i=0;i<loops;i++)
      {
         hpfcTimer.Start(); //start timer for single execution
         status = DftiComputeForward(hand, cxdTimesig.ptr(), cxdFreqsig.ptr());
         cxdTimeLoops = hpfcTimer.Time();
      }
      DftiFreeDescriptor(&hand);
         
      dTimeMax = 0;
      dTimeMin = cxdTimeLoops[0];
      dTimeAvg = 0;
      for (i=0;i<loops;i++)
      {
         dTimeAvg += cxdTimeLoops;
         dTimeMax = max(cxdTimeLoops,dTimeMax);
         dTimeMin = min(cxdTimeLoops,dTimeMin);
      }
      dTimeAvg /= (double) loops;
      k++;
   }
}

dTimeAvg is plottet versus Nfft for float and double. I'm attaching the individual plots with min/max for visualizing the outliners.

Thanks, Marian

0 Kudos
24 Replies
SergeyKostrov
Valued Contributor II
806 Views
Hi Marian, If this is a real problem for your project please try to contact Intel Premier Support. I don't think something could be done on our sides since this is an internal issue with the latest version of MKL. I hope that Intel software engineers will look into it. Best regards, Sergey
0 Kudos
Evgueni_P_Intel
Employee
806 Views

Marian,

The best time observed by the benchmark scales (decreases with the number of threads), but the average time is dominated by instability of measurement.

Here are some tips to stabilize measurements.

  1. Pin threads to CPU cores using the KMP_AFFINITY environment varibale or the Windows API for thread affinity
  2. Ensure the benchmark is single-threaded; if your use-case is multi-threaded, you may want to look through http://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft 
  3. Prevent the cache warm-up time from dominating your performance measurement -- either increase the value of the loops variable in your code, or exclude from measurement the first call to DftiComputeForward for each Nfft.

Please let us know if the above tips help.

Thanks,

Evgueni.

0 Kudos
SergeyKostrov
Valued Contributor II
806 Views
Marian, Could you try to compile the test case with /Qopenmp-report{0|1|2} option? It will be nice to see these reports. Thanks in advance.
0 Kudos
Marian_L_
Beginner
806 Views

Thank you all for the comments. I will try to do it before the evaluation period runs out.

0 Kudos
Reply