parallelization case N°4

MooN_K_ · ‎10-08-2013

Hello MKL professionals

i m working on parallelising the FFTMKL 2048 1D on 4 threads, so that each thread do a 512 FFT, the good thing is it works in the parallel region yet the dfticomputeforward generates wrong results. how to make the recombination of output data of the 4 x 512 independantly?

#pragma omp parallel num_threads(nThread)
{
MKL_LONG status;
int myID = omp_get_thread_num ();
printf("Thread's ID %d\n=", myID);
status = DftiComputeForward( my_desc1_handle, &array11[myID*len]); // but the results of the array11 are false and sometimes i get zeros& //loss of FFT synchronisation

}

//Memory check, output values are not as expected
status1 = DftiFreeDescriptor(&my_desc1_handle);
}

Thank you for answering

Ying_H_Intel · ‎10-08-2013

Hello,

First thought, the problem may be my_desc1_handle. It shouldn't be a shared variable for the 4 external openMP threads.

On the other hand, what is your data type,you may have known that most of FFT function in MKL is threaded (please check mkl user manual). so it may be no necessary for you to parallel them yourself.

for example, In particular, computation of multiple transforms in one call (number of transforms > 1) is threaded ( you can do 4 x512 FFT in one call).

Best Regards,

Ying

MooN_K_ · ‎10-08-2013

Hello Ying

I am trying to run the example 4 of the different parallelization techniques and the "status = DftiComputeForward( my_desc1_handle, &array11[myID*len])" is in the shared memory, but the result remains wrong.

the input data is a sinusoidal signal of N=2048, [0.0004882813, -0.0004882813]

So the Fourrier Transform of this sinusoid above is a dirac pulse located in the N/2 But, all what i get are zeros -0.0000000000 or the same input signal.

here s the code:

#include "mkl_dfti.h"

#include "omp.h"

void main (){

MKL_Complex16 x[2048];

MKL_LONG status;

DFTI_DESCRIPTOR_HANDLE desc_handle;

int nThread = omp_get_max_threads ();

MKL_LONG len=512;

init(x,2048,1028);//sinusoid input init

status = DftiCreateDescriptor (&desc_handle, DFTI_SINGLE, DFTI_COMPLEX, 1, len);

status = DftiSetValue (desc_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);

status = DftiCommitDescriptor (desc_handle);

// each thread calculates an FFT of 512

#pragma omp parallel num_threads(nThread){

MKL_LONG myStatus;

int myID = omp_get_thread_num ();

myStatus = DftiComputeForward (desc_handle, &x [myID * len] );//myID is a number from 0 to 3 related to the thread ID

// x output is the same as the input (No conversion) and no dirac pulse in the N/2

}

status = DftiFreeDescriptor (&desc_handle);

}

According to this example provided by Intel, i tested it with a sinusoid in the input and i need to verify the Dirac pulse in the N/2 point of the output.

Thanks for the help

Ying_H_Intel · ‎10-09-2013

Hi MooN,

Could you please attach the code into a c file. include the init() code?

I 'm not sure which example you are seeing. it seems there are some errors, like your data type are MKL_Complex16, but in DftiCreateDescriptor( DFTI_SINGLE -> DFTI_DOUBLE)?

Here is another similiar discussion about the output http://software.intel.com/en-us/forums/topic/402439.

But another issue is that, 2048 FFT is not equal to 4 x 512 FFT from mathmatics views. so the try may not work. and 2048 ctoc FFT is threaded internally, the parallel work may not needed either.

Best Regards,

Ying

Ying_H_Intel · ‎10-09-2013

Hi MooN,

The OMP code is fine in your code. The main problem looks that you do commit before the set FFT discriptor. After move the code, your program will work. I also fix some tiny problems (attached the fixed code, there 4 512 FFT, then the dirace pulse are in each len/2 = 256 place. you can see the .25 on point 256, 768,1280,1792).

Regarding the MKL FFT support threaded internally and aslo support user defined thread, as you see, there are variable requests for developers who need to parallelized their application. For example, if you have a bunch of arrays, each array need to do one FFT. considering the mult-core reasouce, you may hope do these FFT simutaniously, i.e 4 FFT one times. So the 4 techniques are for that.

I need to correct one of my comments. c2c 2048 is only threaded internally under some condition like 64bit, not 32bit (please see some related of mkl userguide). So if you have bunch of array which length 2048 in 32bit application, then you have good reason to use custom threads.

Best Regards,

Ying

init(array11,2048,1024);
status1 = DftiCreateDescriptor( &my_desc1_handle, DFTI_DOUBLE, DFTI_COMPLEX, 1, len);
int nThread = omp_get_max_threads ();

status1 = DftiSetValue(my_desc1_handle , DFTI_PLACEMENT, DFTI_NOT_INPLACE);
status1 = DftiSetValue (my_desc1_handle, DFTI_NUMBER_OF_USER_THREADS, nThread);

status1 = DftiCommitDescriptor( my_desc1_handle );

Intel® Math Kernel Library 11.1 User's Guide

Threaded Functions and Problems

The following Intel MKL function domains are threaded:

Direct sparse solver.
LAPACK.

For the list of threaded routines, see Threaded LAPACK Routines.
Level1 and Level2 BLAS.

For the list of threaded routines, see Threaded BLAS Level1 and Level2 Routines.
All Level 3 BLAS and all Sparse BLAS routines except Level 2 Sparse Triangular solvers.
All mathematical VML functions.
FFT.

For the list of FFT transforms that can be threaded, see Threaded FFT Problems.

One-dimensional (1D) transforms

1D transforms are threaded in many cases.

1D complex-to-complex (c2c) transforms of sizeNusing interleaved complex data layout are threaded under the following conditions depending on the architecture:

Architecture

Conditions

Intel® 64

Nis a power of 2, log₂(N) > 9, the transform is double-precision out-of-place, and input/output strides equal 1.

IA-32

Nis a power of 2, log₂(N) > 13, and the transform is single-precision.

Nis a power of 2, log₂(N) > 14, and the transform is double-precision.

Any

Nis composite, log₂(N) > 16, and input/output strides equal 1.

1D complex-to-complex transforms using split-complex layout are not threaded.

MooN_K_ · ‎10-09-2013

Thank you for your help Ying it works,

I m still having troubles with working in the parallel region with openmp, and i get a random thread's ID (not in order) example for a number of threads =4, so i get Thread's ID =1,Thread's ID =3,Thread's ID =2,Thread's ID =0 consecutively, and for another execution i get another different order.How to get the right order of IDs eq to 0, 1, 2, 3 ?

Here is the code:

int nThread = omp_get_max_threads ();

#pragma omp parallel num_threads(nThread)
{
int myID=omp_get_thread_num ();

printf("Thread's ID %d \n", myID);

}

Thanks

Ying_H_Intel · ‎10-10-2013

HaHa, it is the exact "trouble" in http://software.intel.com/en-us/forums/topic/475357.

But in most of case, the out of order of thread executation should be the nature feature or "advantage" of the multi-thread application. The executation order of mult-threads should be scheduled by OS based on current system resource. What we can do is to assign correct task to each threads, for example ,

int myID = omp_get_thread_num ();
status = DftiComputeForward( my_desc1_handle, &array11[myID*len],&array12[myID*len]);

Thus whatever the order, your will get wanted the result in result array.

Best Regards,

Ying

MooN_K_ · ‎10-10-2013

Hello Ying

If the FFT MKL is already threaded, then why Intel proposed the 4 techniques to parallelize it? According to my project, the 4th case will work properly with my algorithm. but i need a result output verification of this method which is still missing.

MooN_K_ · ‎12-05-2013

Hello ying, Thank you for your help

After dividing the initial data to 4 and performing the 4 Parallel FFTs to them, is there a way to recombine these 4 FFTs together to have a result as if it was a simple FFT of the initial data directly? That whats missing before jumping into the implementation part.

Any help would be appreciated :)

Thanks

Moon

Ying_H_Intel · ‎12-08-2013

Hello MooN,

Do you hope to do a bundle of 2048 1D FFT or only one 2048 1D FFT?

As we discussed last time, the parallel case 4 mainly focus on do multiply 2048 1D FFT in parallel. So if you really need to do one 2048 1D FFT through 4 threads ( though it may not bring performance benefit , that is why MKL haven't threaded it internally), as 2048 FFT is not equal to 4 x 512 FFT simply, you may need to calculate the whole processing from mathematics views, for example butterfly algorithm, then employ corresponding MKL functions to complete it. (like reorganize the result, do FFT again). MKL should not provide such function to recombine them directly.

Best Regards,

Ying