Solved: How to get best performance with MKL FFT ?

afd_lml · ‎12-13-2009

Hi, all
My workstation has 24-core. The source code is something like the following:

void test()
{
// define some data
const intsize = 10000;
double x[size], y[size], z[size];
doule sum[size];

// nowwecompute x,y,z
work(x, size);
work(y, size);
work(z, size);

// sum x, y, z
for(int i=0; i sum = x + y + z;
}

void work(double array[], const int size)
{
// write data to array
for (int i=0; i array = .....;

// do FFT with MKL
DFTI_DESCRIPTOR_HANDLE my_handle;

MKL_LONG status;

status = DftiCreateDescriptor(&my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);

status = DftiCommitDescriptor(my_handle);

status = DftiComputeForward(my_handle, array);

status = DftiFreeDescriptor(&my_handle);
}

Then, I use openMP to rewrite the above code as follows:

void test()
{
// define some data
const int size = 10000;
double x[size], y[size], z[size];
doule sum[size];

// now we compute x,y,z
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
work(x, size);

#pragma omp section
work(y, size);

#pragma omp section
work(z, size);
}

// sum x, y, z
#pragma ompfor
for(int i=0; i sum = x + y + z;
}

void work(double array[], const int size)
{
// write data to array
#pragma omp parallel for
for (int i=0; i array = .....;

int nThread = 8; // the computer has 24 cores, so I set 24/3=8

set_MKL_num_threads(nThread); // the computer has 24 cores, so I set 24/3=8

// do FFT with MKL
DFTI_DESCRIPTOR_HANDLE my_handle;

MKL_LONG status;

status = DftiCreateDescriptor(&my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);

status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);

status = DftiCommitDescriptor(my_handle);

status = DftiComputeForward(my_handle, array);

status = DftiFreeDescriptor(&my_handle);
}

However, from Windows Task Manager, I find the CPU usage is only 30%, not the expected 100%, why ?

Would anyone give me some help ? Many thanks.

Dmitry_B_Intel · ‎12-13-2009

Hi,

Complex-to-complex 1D FFT of size > 2^16 (or smaller if it is 2-power) are threaded.

If you are going to share descriptor between K threads (K=3 in your example), then it should be done like this:

MKL_set_dynamic(0);
omp_set_nested(1);
...
DftiSetValue(hand, DFTI_NUMBER_OF_USER_THREADS, K);
DftiCommitDescriptor(hand);
#pragma omp parallel
{
...
work(..., hand); // K calls in the parallel team
}

Thanks
Dima

View solution in original post

Dmitry_B_Intel · ‎12-13-2009

Hi,

Thank you for providing alldetails in your question.There are several things to comment on.

1D real-to-complex transforms are not threaded in MKL, so you shouldonly observemaximum 3/24CPU usage when in DftiComputeForward. What you see as 30% is probably the parallel initialization of array[] in work().

In work() you create a descriptor and use it on the same thread, that is the descriptor is not shared between multiple user threads,so it makes no sense to set DFTI_NUMBER_OF_USER_THREADS there. It would make sense if the descriptor were created in test() and then shared by threeinvocations of work() that would use the descriptor concurrently.

By default, MKL would not go parallel if it is called from omp parallel region. You should explicitly let it go parallel by setting environment variable MKL_DYNAMIC=false and allowing nested parallelism by setting OMP_NESTED=true (the lattermight be the default setting). Of course, you may set this by calling respective functions instead of setting environment variables.

And finally please notice that real-to-complex transform of size N requires sligthly more space than double, because the result consists of N/2+1 complex numbers, that is size of arrays should be at least 2*(size/2+1) in your case.

Thanks
Dima

afd_lml · ‎12-13-2009

Thank you for your reply.

Quoting - Dmitry Baksheev (Intel)

Hi,

Thank you for providing alldetails in your question.There are several things to comment on.

1D real-to-complex transforms are not threaded in MKL, so you shouldonly observemaximum 3/24CPU usage when in DftiComputeForward. What you see as 30% is probably the parallel initialization of array[] in work().
How about 1D complex-to-complex FFT ?

In work() you create a descriptor and use it on the same thread, that is the descriptor is not shared between multiple user threads,so it makes no sense to set DFTI_NUMBER_OF_USER_THREADS there. It would make sense if the descriptor were created in test() and then shared by threeinvocations of work() that would use the descriptor concurrently.

Do you mean the following three functions should be invoked in test() ?

status = DftiCreateDescriptor(&my_handle, DFTI_SINGLE, DFTI_REAL, 1, size);
status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread); // nThread = ?
status = DftiCommitDescriptor(my_handle);

By default, MKL would not go parallel if it is called from omp parallel region. You should explicitly let it go parallel by setting environment variable MKL_DYNAMIC=false and allowing nested parallelism by setting OMP_NESTED=true (the lattermight be the default setting). Of course, you may set this by calling respective functions instead of setting environment variables.

Should I call MKL_set_dynamic(false) andomp_set_nested(1) in test()?
How to call status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread) ?
Shoudl nThread be be 1 or without calling DftiSetValue() at all ?

And finally please notice that real-to-complex transform of size N requires sligthly more space than double, because the result consists of N/2+1 complex numbers, that is size of arrays should be at least 2*(size/2+1) in your case.

Thanks
Dima

Dmitry_B_Intel · ‎12-13-2009

Hi,

Complex-to-complex 1D FFT of size > 2^16 (or smaller if it is 2-power) are threaded.

If you are going to share descriptor between K threads (K=3 in your example), then it should be done like this:

MKL_set_dynamic(0);
omp_set_nested(1);
...
DftiSetValue(hand, DFTI_NUMBER_OF_USER_THREADS, K);
DftiCommitDescriptor(hand);
#pragma omp parallel
{
...
work(..., hand); // K calls in the parallel team
}

Thanks
Dima