- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I recently noticed that when using a threaded 1-dimensional DFT, a DFTI_COMPLEX domain DFT does not appear to respect the DFTI_THREAD_LIMIT and instead always uses the threading value set by mkl_set_num_threads(). Furthermore, it appears that while a REAL domain DFT does obey DFTI_THREAD_LIMIT, its behavior has changed between MKL v11.1 and 11.2.
In MKL v11.1, setting mkl_set_num_threads(4) followed by DftiSetValue(dft_handle, DFTI_THREAD_LIMIT, 2) caused a complex-valued DFT to spawn and use 4 threads while a real-valued DFT only spawned and used 2 threads. In MKL 11.2, the complex-valued DFT still spawned and used 4 cores, but now the real-valued DFT also spawned 4 cores but only utilized 2 of them (2 cores were utilized at 90%+ and 2 were utilized at ~10%). Below is the example code I've been using to replicate this behavior. I was wondering if I'm doing something wrong or possibly misunderstanding the expected behavior of DFTI_THREAD_LIMIT.
// mkl_thread_test.cpp - Computes large threaded DFTs
// arg1 = 'C' for complex domain or 'R' for real (optional, default REAL)
// arg2 = scale factor for DFT (optional, default 1)
#include <iostream>
#include <cstdlib>
#include <vector>
#include <string.h>
#include <mkl.h>
#include <omp.h>
using namespace std;
int main(int argc, char* argv[])
{
int mklThreads = 4;
int dftThreads = 2;
int dftSize = 10000000;
int loops = 100;
DFTI_CONFIG_VALUE domain = DFTI_REAL;
int sizeMultiplier = 1;
if(argv[1][0] == 'C')
{
domain = DFTI_COMPLEX;
sizeMultiplier = 2;
}
float scale = 1.0f;
if(argc > 2) scale = atof(argv[2]);
// print version number for reference
char version[DFTI_VERSION_LENGTH];
DftiGetValue(0, DFTI_VERSION, version);
cerr<<"MKL Version: "<<version<<endl;
vector<float> vin((dftSize+loops)*sizeMultiplier);
vector<float> vout(dftSize*sizeMultiplier);
vector<float> vtmp(loops); // for saving output to avoid compiler optimizing out computation
for(int i=0; i<vin.size(); ++i)
{
vin = float(rand())/float(RAND_MAX)-.5;
}
MKL_LONG status;
DFTI_DESCRIPTOR_HANDLE dft;
mkl_set_num_threads(mklThreads);
status = DftiCreateDescriptor(&dft, DFTI_SINGLE, domain, 1, dftSize);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
status = DftiSetValue(dft, DFTI_PLACEMENT, DFTI_INPLACE);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
status = DftiSetValue(dft, DFTI_PACKED_FORMAT, DFTI_PERM_FORMAT);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
status = DftiSetValue(dft, DFTI_ORDERING, DFTI_ORDERED);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
status = DftiSetValue(dft, DFTI_FORWARD_SCALE, scale);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
status = DftiSetValue(dft, DFTI_THREAD_LIMIT, dftThreads);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
status = DftiCommitDescriptor(dft);
if (status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
cerr<<"Computing "<<loops;
if(domain == DFTI_COMPLEX) cerr<<" COMPLEX";
else cerr<<" REAL";
cerr<<" DFTs of size "<<dftSize<<" using "<<mklThreads<<" MKL threads but limiting DFT to "<<dftThreads<<" threads"<<endl;
MKL_LONG threadLimit;
status = DftiGetValue(dft, DFTI_THREAD_LIMIT, &threadLimit);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
cerr<<"Thread limit: "<<threadLimit<<endl;
for(int i=0; i<loops; ++i)
{
memcpy(&vout[0], &vin, dftSize*sizeMultiplier*sizeof(float));
status = DftiComputeForward(dft, &vout[0]);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
vtmp = vout[0]; // save a value to avoid optimizing out the DFT computation
}
cerr<<"Finished execution"<<endl;
status = DftiFreeDescriptor(&dft);
if(status != DFTI_NO_ERROR) cerr<<DftiErrorMessage(status)<<endl;
return 0;
}
For reference, I'm running on a quad-core processor running 64-bit linux (Ubuntu 14.04) and compiling with icpc v14.0.4 (for MKL v11.1) and v15.0.2 (for MKL v11.2). My compile line looks like:
icpc -O3 -xHost -openmp -I${MKLROOT}/include -o mkl_thread_test mkl_thread_test.cpp -L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm -liomp5
When I call ./mkl_thread_test C, I notice 4 cores spinning at 95%, when I call ./mkl_thread_test R, I notice just 2 cores spinning at 95% for MKL v11.1 and 2 cores spinning at 95% plus 2 more cores spinning at 10% for MKL v11.2. My exact versions of MKL are 11.1.4 Product Build 20140806 and 11.2.2 Product Build 20150120. In both cases, the value returned by DftiGetValue(DFTI_THREAD_LIMIT) is 2 so my expectation is that the DFT should only be using 2 threads regardless of MKL version or real vs. complex DFT.
Am I doing something wrong or should MKL be respecting the value of DFTI_THREAD_LIMIT?
--Nick
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Thank for the sample, I had supposed, the threaded of 1D FFT are limited here. but according to the doc.
1D transforms are threaded in many cases.
(N) > 16, and input/output strides equal 1.
1D complex-to-complex transforms using split-complex layout are not threaded.
Multidimensional transforms
We will look into this and get back to you.
Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Nick,
Thank you for reporting this issue. We will fix it soon in one of future releases of Intel MKL.
You may want to use mkl_set_num_threads_local to set the number of threads for _all_ Intel MKL functions on the _current_ execution thread, or mkl_set_num_threads to set the number of threads for _all_ Intel MKL functions on _all_ execution threads, or mkl_domain_set_num_threads to set the number of threads for _specific_ Intel MKL functions on _all_ execution threads.
Thank you!
Evgueni.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No problem, happy to help. I had a few follow up questions. I noticed when using a complex DFT in the example provided if I change DFTI_ORDERED to DFTI_BACKWARD_SCRAMBLED the thread limit automatically gets set to 1 and only a single thread is used. Is this behavior expected? I couldn't find anything in the documentation about ordering impacting thread limit.
Also, if I use two different DFT descriptor handles within the same programming scope and I want each to use a different number of threads (lets say 4 and 2 respectively), is it safe to call mkl_set_num_threads(4), commit the first DFT descriptor, then call mkl_set_num_threads(2) and commit the second descriptor then use both within a loop? Will this effectively let the first handle use 4 threads and the second use 2 (assuming I have 6+ cpus to work with on the machine)?
Cheers,
Nick
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If DFTI_BACKWARD_SCRAMBLED is set, 1D FFTs are not threaded, though they could be and they would possibly be faster than in the DFTI_ORDERED case. If performance of large 1D transforms is critical for your application, please request tuning the scrambled case at premier.intel.com.
The workaround that you propose (calling mkl_set_num_threads) should work with the existing versions of Intel MKL but may stop working in the future.
Thank you!
Evgueni.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page