Parallelizing FFT not seeing 100% CPU

Marshall__Michael_B · ‎01-15-2010

I started a new thread because it seemed I could not continue discussion on http://software.intel.com/en-us/forums/showthread.php?t=71035

problem: when I use the MKL FFT I only see one processor in use

Processor: Intel Core 2 Duo CPU E8400 @ 3.00GHz 2.99 GHz

RAM: 4.0 GB

OS: Windows 7 Pro 64 bit

env: OMP_NUM_THREADS = 2 (this is set in User variables. does it need to be in system variables?)

MKL version : 9.1.026

linking against em64t/lib/mkl_3m64t.lib

I am creating an x64 executable using Visual Studio 2005

I am aligning arrays to 128 byte divisible

Any help is appreciated. Thanks in advance

numElements is 1 <<23

code :

[bash]static INT32 IntelDoubleFFT(INT8     transformType,  //type of transform (1: normal or -1: inverse)
                            double * realDataArray,  //data array (input and output)
                            double * imagDataArray,  //imaginary data array
                            UINT32   numElements)    //size of each data array
{
    UINT32 i;
    _MKL_Complex16 *compDataArray;
    _MKL_Complex16 *out;

    DFTI_DESCRIPTOR_HANDLE complexDescriptor;
    long status = DFTI_NO_ERROR;

    start_o = clock();
	
    compDataArray = (_MKL_Complex16*)calloc(numElements, sizeof(*compDataArray) + 256);
    if (compDataArray == NULL) {
        return -1;
    }

    out = (_MKL_Complex16*)calloc(numElements, sizeof(*compDataArray) + 256);
    if (out == NULL) {
        return -1;
    }

	UINT64 temp;
	char *align;
	align = (char*)out;
	temp = (UINT64)align;
	align = align + (temp % 128);
	out = (_MKL_Complex16*)align;

	align = (char*)compDataArray;
	temp = (UINT64)align;
	align = align + (temp % 128);
	compDataArray = (_MKL_Complex16*)align;
	
	printf("aligned to %p and %p\n", compDataArray, out);

    //combine real and imag arrays into single complex array for DFT call
    for (i = 0; i < numElements; i++) {
        compDataArray.real = realDataArray;
        compDataArray.imag = -1.0 * imagDataArray;
    }

    finish_o = clock();
	overhead += (double)(finish_o - start_o) / CLOCKS_PER_SEC;

    //set up descriptor handle - handle, precision, forward_domain, dimension, length
    status = DftiCreateDescriptor(&complexDescriptor, DFTI_DOUBLE, DFTI_COMPLEX, 1, numElements);

    if (status == DFTI_NO_ERROR) {
        //set the scale factor for the backward transform to be 1/n to make it the inverse of the forward transform
        status = DftiSetValue(complexDescriptor, DFTI_BACKWARD_SCALE, (double) 1 / numElements);
        DftiSetValue(complexDescriptor, DFTI_PLACEMENT, DFTI_NOT_INPLACE);

        if (status == DFTI_NO_ERROR) {
            //commit descriptor for initial calculations
            status = DftiCommitDescriptor(complexDescriptor);

            if (status == DFTI_NO_ERROR) {
                //compute DFT
                if (transformType == 1) { //forward (normal) DFT
                    status = DftiComputeForward(complexDescriptor, compDataArray, out);
                } else { //backward (inverse) DFT
                    status = DftiComputeBackward(complexDescriptor, compDataArray, out);
                }
            }
        }
        DftiFreeDescriptor(&complexDescriptor);          //free memory
    }

    start_o = clock();

    //split complex array for output
    for (i = 0; i < numElements; i++) {
        realDataArray = out.real;
        imagDataArray = -1.0 * out.imag;
    }

    //free(compDataArray);
    //free(out);

    finish_o = clock();
	overhead += (double)(finish_o - start_o) / CLOCKS_PER_SEC;

    if (status == DFTI_NO_ERROR) {
        return 0;
    } else {
        return -1;
    }
}[/bash]

Dmitry_B_Intel · ‎01-17-2010

Hi,

Youmayset envvar KMP_AFFINITY=verbose,compact to see how many threadsFFT starts internally. In your case the transform should be done in parallel, with ~40% improvement on 2 threads.

Thanks
Dima

Gennady_F_Intel · ‎01-17-2010

Quoting mimarsh2

OS: Windows 7 Pro 64 bit
env: OMP_NUM_THREADS = 2 (this is set in User variables. does it need to be in system variables?)

MKL version : 9.1.026

>>>>>>
yes, only in this case you will have the performance improvements on 2 threads,
of whom Dima mentioned above

--Gennady