Hi Nick,

nicpac22 · ‎04-22-2015

Hi,

I recently noticed that when using a threaded 1-dimensional DFT, a DFTI_COMPLEX domain DFT does not appear to respect the DFTI_THREAD_LIMIT and instead always uses the threading value set by mkl_set_num_threads(). Furthermore, it appears that while a REAL domain DFT does obey DFTI_THREAD_LIMIT, its behavior has changed between MKL v11.1 and 11.2.

In MKL v11.1, setting mkl_set_num_threads(4) followed by DftiSetValue(dft_handle, DFTI_THREAD_LIMIT, 2) caused a complex-valued DFT to spawn and use 4 threads while a real-valued DFT only spawned and used 2 threads. In MKL 11.2, the complex-valued DFT still spawned and used 4 cores, but now the real-valued DFT also spawned 4 cores but only utilized 2 of them (2 cores were utilized at 90%+ and 2 were utilized at ~10%). Below is the example code I've been using to replicate this behavior. I was wondering if I'm doing something wrong or possibly misunderstanding the expected behavior of DFTI_THREAD_LIMIT.

// mkl_thread_test.cpp - Computes large threaded DFTs
//  arg1 = 'C' for complex domain or 'R' for real (optional, default REAL)
//  arg2 = scale factor for DFT (optional, default 1)

#include <iostream>
#include <cstdlib>
#include <vector>
#include <string.h>
#include <mkl.h>
#include <omp.h>

using namespace std;

int main(int argc, char* argv[])
{
  int mklThreads = 4;
  int dftThreads = 2;
  int dftSize = 10000000;
  int loops = 100;
  
  DFTI_CONFIG_VALUE domain = DFTI_REAL;
  int sizeMultiplier = 1;
  if(argv[1][0] == 'C')
  {
    domain = DFTI_COMPLEX;
    sizeMultiplier = 2;
  }
  float scale = 1.0f;
  if(argc > 2) scale = atof(argv[2]);
  
  // print version number for reference
  char version[DFTI_VERSION_LENGTH];
  DftiGetValue(0, DFTI_VERSION, version);
  cerr<<"MKL Version: "<<version<<endl;
  
  vector<float> vin((dftSize+loops)*sizeMultiplier);
  vector<float> vout(dftSize*sizeMultiplier);
  vector<float> vtmp(loops); // for saving output to avoid compiler optimizing out computation
  for(int i=0; i<vin.size(); ++i)
  {
    vin = float(rand())/float(RAND_MAX)-.5;
  }
  
  MKL_LONG status;
  DFTI_DESCRIPTOR_HANDLE dft;
  mkl_set_num_threads(mklThreads);
  status = DftiCreateDescriptor(&dft, DFTI_SINGLE, domain, 1, dftSize);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  status = DftiSetValue(dft, DFTI_PLACEMENT, DFTI_INPLACE);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  status = DftiSetValue(dft, DFTI_PACKED_FORMAT, DFTI_PERM_FORMAT);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  status = DftiSetValue(dft, DFTI_ORDERING, DFTI_ORDERED);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  status = DftiSetValue(dft, DFTI_FORWARD_SCALE, scale);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  status = DftiSetValue(dft, DFTI_THREAD_LIMIT, dftThreads);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  status = DftiCommitDescriptor(dft);
  if (status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  cerr<<"Computing "<<loops;
  if(domain == DFTI_COMPLEX) cerr<<" COMPLEX";
  else cerr<<" REAL";
  cerr<<" DFTs of size "<<dftSize<<" using "<<mklThreads<<" MKL threads but limiting DFT to "<<dftThreads<<" threads"<<endl;
  
  MKL_LONG threadLimit;
  status = DftiGetValue(dft, DFTI_THREAD_LIMIT, &threadLimit);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  cerr<<"Thread limit: "<<threadLimit<<endl;
  
  for(int i=0; i<loops; ++i)
  {
    memcpy(&vout[0], &vin, dftSize*sizeMultiplier*sizeof(float));
    status = DftiComputeForward(dft, &vout[0]);
    if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
    vtmp = vout[0]; // save a value to avoid optimizing out the DFT computation
  }
  cerr<<"Finished execution"<<endl;
  status = DftiFreeDescriptor(&dft);
  if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
  
  return 0;
}

For reference, I'm running on a quad-core processor running 64-bit linux (Ubuntu 14.04) and compiling with icpc v14.0.4 (for MKL v11.1) and v15.0.2 (for MKL v11.2). My compile line looks like:

icpc -O3 -xHost -openmp -I${MKLROOT}/include -o mkl_thread_test mkl_thread_test.cpp -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -liomp5

When I call ./mkl_thread_test C, I notice 4 cores spinning at 95%, when I call ./mkl_thread_test R, I notice just 2 cores spinning at 95% for MKL v11.1 and 2 cores spinning at 95% plus 2 more cores spinning at 10% for MKL v11.2. My exact versions of MKL are 11.1.4 Product Build 20140806 and 11.2.2 Product Build 20150120. In both cases, the value returned by DftiGetValue(DFTI_THREAD_LIMIT) is 2 so my expectation is that the DFT should only be using 2 threads regardless of MKL version or real vs. complex DFT.

Am I doing something wrong or should MKL be respecting the value of DFTI_THREAD_LIMIT?

--Nick

Ying_H_Intel · ‎04-23-2015

Hi Nick,

Thank for the sample, I had supposed, the threaded of 1D FFT are limited here. but according to the doc.

1D transforms are threaded in many cases.

(N) > 16, and input/output strides equal 1.
1D complex-to-complex transforms using split-complex layout are not threaded.
Multidimensional transforms

We will look into this and get back to you.

Regards,

Ying

Evgueni_P_Intel · ‎04-23-2015

Hi Nick,

Thank you for reporting this issue. We will fix it soon in one of future releases of Intel MKL.

You may want to use mkl_set_num_threads_local to set the number of threads for _all_ Intel MKL functions on the _current_ execution thread, or mkl_set_num_threads to set the number of threads for _all_ Intel MKL functions on _all_ execution threads, or mkl_domain_set_num_threads to set the number of threads for _specific_ Intel MKL functions on _all_ execution threads.

Thank you!

Evgueni.

nicpac22 · ‎04-23-2015

No problem, happy to help. I had a few follow up questions. I noticed when using a complex DFT in the example provided if I change DFTI_ORDERED to DFTI_BACKWARD_SCRAMBLED the thread limit automatically gets set to 1 and only a single thread is used. Is this behavior expected? I couldn't find anything in the documentation about ordering impacting thread limit.

Also, if I use two different DFT descriptor handles within the same programming scope and I want each to use a different number of threads (lets say 4 and 2 respectively), is it safe to call mkl_set_num_threads(4), commit the first DFT descriptor, then call mkl_set_num_threads(2) and commit the second descriptor then use both within a loop? Will this effectively let the first handle use 4 threads and the second use 2 (assuming I have 6+ cpus to work with on the machine)?

Cheers,

Nick

Evgueni_P_Intel · ‎04-24-2015

If DFTI_BACKWARD_SCRAMBLED is set, 1D FFTs are not threaded, though they could be and they would possibly be faster than in the DFTI_ORDERED case. If performance of large 1D transforms is critical for your application, please request tuning the scrambled case at premier.intel.com.

The workaround that you propose (calling mkl_set_num_threads) should work with the existing versions of Intel MKL but may stop working in the future.

Thank you!

Evgueni.

Complex 1-D DFT not respecting DFTI_THREAD_LIMIT