- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I started a new thread because it seemed I could not continue discussion on http://software.intel.com/en-us/forums/showthread.php?t=71035
problem: when I use the MKL FFT I only see one processor in use
Processor: Intel Core 2 Duo CPU E8400 @ 3.00GHz 2.99 GHz
RAM: 4.0 GB
OS: Windows 7 Pro 64 bit
env: OMP_NUM_THREADS = 2 (this is set in User variables. does it need to be in system variables?)
MKL version : 9.1.026
linking against em64t/lib/mkl_3m64t.lib
I am creating an x64 executable using Visual Studio 2005
I am aligning arrays to 128 byte divisible
Any help is appreciated. Thanks in advance
numElements is 1 <<23
code :
[bash]static INT32 IntelDoubleFFT(INT8 transformType, //type of transform (1: normal or -1: inverse)
double * realDataArray, //data array (input and output)
double * imagDataArray, //imaginary data array
UINT32 numElements) //size of each data array
{
UINT32 i;
_MKL_Complex16 *compDataArray;
_MKL_Complex16 *out;
DFTI_DESCRIPTOR_HANDLE complexDescriptor;
long status = DFTI_NO_ERROR;
start_o = clock();
compDataArray = (_MKL_Complex16*)calloc(numElements, sizeof(*compDataArray) + 256);
if (compDataArray == NULL) {
return -1;
}
out = (_MKL_Complex16*)calloc(numElements, sizeof(*compDataArray) + 256);
if (out == NULL) {
return -1;
}
UINT64 temp;
char *align;
align = (char*)out;
temp = (UINT64)align;
align = align + (temp % 128);
out = (_MKL_Complex16*)align;
align = (char*)compDataArray;
temp = (UINT64)align;
align = align + (temp % 128);
compDataArray = (_MKL_Complex16*)align;
printf("aligned to %p and %p\n", compDataArray, out);
//combine real and imag arrays into single complex array for DFT call
for (i = 0; i < numElements; i++) {
compDataArray.real = realDataArray;
compDataArray.imag = -1.0 * imagDataArray;
}
finish_o = clock();
overhead += (double)(finish_o - start_o) / CLOCKS_PER_SEC;
//set up descriptor handle - handle, precision, forward_domain, dimension, length
status = DftiCreateDescriptor(&complexDescriptor, DFTI_DOUBLE, DFTI_COMPLEX, 1, numElements);
if (status == DFTI_NO_ERROR) {
//set the scale factor for the backward transform to be 1/n to make it the inverse of the forward transform
status = DftiSetValue(complexDescriptor, DFTI_BACKWARD_SCALE, (double) 1 / numElements);
DftiSetValue(complexDescriptor, DFTI_PLACEMENT, DFTI_NOT_INPLACE);
if (status == DFTI_NO_ERROR) {
//commit descriptor for initial calculations
status = DftiCommitDescriptor(complexDescriptor);
if (status == DFTI_NO_ERROR) {
//compute DFT
if (transformType == 1) { //forward (normal) DFT
status = DftiComputeForward(complexDescriptor, compDataArray, out);
} else { //backward (inverse) DFT
status = DftiComputeBackward(complexDescriptor, compDataArray, out);
}
}
}
DftiFreeDescriptor(&complexDescriptor); //free memory
}
start_o = clock();
//split complex array for output
for (i = 0; i < numElements; i++) {
realDataArray = out.real;
imagDataArray = -1.0 * out.imag;
}
//free(compDataArray);
//free(out);
finish_o = clock();
overhead += (double)(finish_o - start_o) / CLOCKS_PER_SEC;
if (status == DFTI_NO_ERROR) {
return 0;
} else {
return -1;
}
}[/bash]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Youmayset envvar KMP_AFFINITY=verbose,compact to see how many threadsFFT starts internally. In your case the transform should be done in parallel, with ~40% improvement on 2 threads.
Thanks
Dima
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
env: OMP_NUM_THREADS = 2 (this is set in User variables. does it need to be in system variables?)
MKL version : 9.1.026
>>>>>>
yes, only in this case you will have the performance improvements on 2 threads,of whom Dima mentioned above

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page