Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Parallelizing FFT not seeing 100% CPU

Marshall__Michael_B
540 Views

I started a new thread because it seemed I could not continue discussion on http://software.intel.com/en-us/forums/showthread.php?t=71035

problem: when I use the MKL FFT I only see one processor in use

Processor: Intel Core 2 Duo CPU E8400 @ 3.00GHz 2.99 GHz

RAM: 4.0 GB

OS: Windows 7 Pro 64 bit

env: OMP_NUM_THREADS = 2 (this is set in User variables. does it need to be in system variables?)

MKL version : 9.1.026

linking against em64t/lib/mkl_3m64t.lib

I am creating an x64 executable using Visual Studio 2005

I am aligning arrays to 128 byte divisible

Any help is appreciated. Thanks in advance

numElements is 1 <<23

code :

[bash]static INT32 IntelDoubleFFT(INT8     transformType,  //type of transform (1: normal or -1: inverse)
double * realDataArray, //data array (input and output)
double * imagDataArray, //imaginary data array
UINT32 numElements) //size of each data array
{
UINT32 i;
_MKL_Complex16 *compDataArray;
_MKL_Complex16 *out;

DFTI_DESCRIPTOR_HANDLE complexDescriptor;
long status = DFTI_NO_ERROR;

start_o = clock();

compDataArray = (_MKL_Complex16*)calloc(numElements, sizeof(*compDataArray) + 256);
if (compDataArray == NULL) {
return -1;
}

out = (_MKL_Complex16*)calloc(numElements, sizeof(*compDataArray) + 256);
if (out == NULL) {
return -1;
}

UINT64 temp;
char *align;
align = (char*)out;
temp = (UINT64)align;
align = align + (temp % 128);
out = (_MKL_Complex16*)align;

align = (char*)compDataArray;
temp = (UINT64)align;
align = align + (temp % 128);
compDataArray = (_MKL_Complex16*)align;

printf("aligned to %p and %p\n", compDataArray, out);

//combine real and imag arrays into single complex array for DFT call
for (i = 0; i < numElements; i++) {
compDataArray.real = realDataArray;
compDataArray.imag = -1.0 * imagDataArray;
}

finish_o = clock();
overhead += (double)(finish_o - start_o) / CLOCKS_PER_SEC;

//set up descriptor handle - handle, precision, forward_domain, dimension, length
status = DftiCreateDescriptor(&complexDescriptor, DFTI_DOUBLE, DFTI_COMPLEX, 1, numElements);

if (status == DFTI_NO_ERROR) {
//set the scale factor for the backward transform to be 1/n to make it the inverse of the forward transform
status = DftiSetValue(complexDescriptor, DFTI_BACKWARD_SCALE, (double) 1 / numElements);
DftiSetValue(complexDescriptor, DFTI_PLACEMENT, DFTI_NOT_INPLACE);

if (status == DFTI_NO_ERROR) {
//commit descriptor for initial calculations
status = DftiCommitDescriptor(complexDescriptor);

if (status == DFTI_NO_ERROR) {
//compute DFT
if (transformType == 1) { //forward (normal) DFT
status = DftiComputeForward(complexDescriptor, compDataArray, out);
} else { //backward (inverse) DFT
status = DftiComputeBackward(complexDescriptor, compDataArray, out);
}
}
}
DftiFreeDescriptor(&complexDescriptor); //free memory
}

start_o = clock();

//split complex array for output
for (i = 0; i < numElements; i++) {
realDataArray = out.real;
imagDataArray = -1.0 * out.imag;
}

//free(compDataArray);
//free(out);

finish_o = clock();
overhead += (double)(finish_o - start_o) / CLOCKS_PER_SEC;

if (status == DFTI_NO_ERROR) {
return 0;
} else {
return -1;
}
}[/bash]

0 Kudos
2 Replies
Dmitry_B_Intel
Employee
540 Views

Hi,

Youmayset envvar KMP_AFFINITY=verbose,compact to see how many threadsFFT starts internally. In your case the transform should be done in parallel, with ~40% improvement on 2 threads.

Thanks
Dima

0 Kudos
Gennady_F_Intel
Moderator
540 Views
Quoting mimarsh2
OS: Windows 7 Pro 64 bit

env: OMP_NUM_THREADS = 2 (this is set in User variables. does it need to be in system variables?)

MKL version : 9.1.026

>>>>>>

yes, only in this case you will have the performance improvements on 2 threads,
of whom Dima mentioned above
--Gennady
0 Kudos
Reply