Solved: OPENMP and MKL FFT: strange behavior

clodxp · ‎05-10-2010

Hi all!
I've collected an example wherein a VERY STRANGE BEHAVIOR happens: the use of an FFT within an OpenMP cycle, with OMP_NUM_THREADS=1, seems to go about 10 time faster than the serial version!!

The code essentially is made by a cycle wherein an FFT is performed. I would like that each thread would perform a part of the M FFTS to be computed.

CYCLE SERIAL VERSION -------------------------

do xx=1,M

!initalize data to be transformed
data_fft=xx+imag*xx*2.

! Perform FFT
Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements
sum_vect(xx)=sum(data_fft)

end do
!----------------------------------------------------------

I've parallelized this cycle in two version: correct and uncorrect.

In the first one, the correct one (see FFT_2D_openmp.f90 attached) the cycle is parallelized as follows

CYCLE PARALLEL VERSION -----------------------------
!$omp parallel
!$omp do private(data_fft) schedule(static,num_it_schedule)
do xx=1,M

!initalize data to be transformed
data_fft=xx+imag*xx*2.

! Perform FFT
Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements
sum_vect(xx)=sum(data_fft)

end do
!$omp end do
!$omp end parallel
--------------------------------------------------------------

To obtain a correct functioning it is necessary to set DFTI_NUMBER_OF_USER_THREADS=to the number of running threads.

OK!The correct version has been obtained after the wrong on (see FFT_2D_openmp_wrong.f90), showing the strange behavior.
In the wrong version i've made an error, since a declared as private in the cycle also the status and handle of the FFT:

CYCLE PARALLEL VERSION (WRONG!!) -----------------------------------------------------
!$omp parallel
!$omp do private(data_fft,Status,Desc_Handle) schedule(static,num_it_schedule)
do xx=1,M

!initalize data to be transformed
data_fft=xx+imag*xx*2.

! Perform FFT
Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements
sum_vect(xx)=sum(data_fft)

end do
!$omp end do
!$omp end parallel
---------------------------------------------------------------------------------------------------

Unfortunately, the result (the sum of the element of sum_vect) is correct (if compared to the results of the serial version) and the time is about 10 time lower!!!

This is the execution of FFT_2D_openmp_wrong.f90 on my machine (Mac Pro 8-core).

---------- S t a r t
+ Matrix size Nx,Ny = 300.0000 300.0000
+ Cycle over M = 100.0000
--> SERIAL
+ Serial execution time = 0.120011000006343
+ Serial result (sum) = (4.5450000E+08,9.0900000E+08)
--> PARALLEL
+ Number of threads = 1
+ Parallel execution time = 9.684999997261912E-003
+ Parallel result (sum) = (4.5450000E+08,9.0900000E+08)
--> SPEEDUP (ideal = nthread) = 12.3914300506218
--> EFFICIENCY (ideal =1) = 12.3914300506218
---------- S t o p

The parallel execution time is about 12 time lower than the serial one, while a correct working is obtained with FFT_2D_openmp.f90.

Can someone explain this??!
And please can confirm the correct use of the MKL FFT for my needs??

Thanks

Clodxp

Evgueni_P_Intel · ‎05-11-2010

Hi Clodxp,

The initial value of a private variable in an OpenMP sectionis undefined (FORTRAN OpenMP 2.0 http://www.openmp.org/mp-documents/fspec20.pdf, p. 35; OpenMP 3.0 http://www.openmp.org/mp-documents/spec30.pdf, p.90).
Hence each call to DftiComputeForward in the parallel part of FFT_2D_openmp_wrong.f90 returns DFTI_BAD_DESCRIPTOR and doesn't change the input data.

Ironically, we can't catch this error by checking, as in FFT_2D_openmp_wrong.f90, sums for a constant signal v=(v[1], v[2], ..., v)where v =v=c for all i and j, because FFT(v) = (c*N, 0, 0, ..., 0) in this case.
If DftiComputeForward succeeds, then v is replaced with FFT(v) and we get sum(FFT(v)) = c*N.
If DftiComputeForward fails, then v isn't changed and we getsum(v) = c*N.
Hence you see the same sum in the sequential and "parallel" case...

You may find useful the following Knowledge Base articlehttp://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/about parallelization of (2D) FFTs.
Given only FFT_2D_openmp_wrong.f90 and in FFT_2D_openmp.f90, it's hard to tell what are your needs and what would be the correct use of MKL FFT for you.

View solution in original post

Evgueni_P_Intel · ‎05-11-2010

Hi Clodxp,

The initial value of a private variable in an OpenMP sectionis undefined (FORTRAN OpenMP 2.0 http://www.openmp.org/mp-documents/fspec20.pdf, p. 35; OpenMP 3.0 http://www.openmp.org/mp-documents/spec30.pdf, p.90).
Hence each call to DftiComputeForward in the parallel part of FFT_2D_openmp_wrong.f90 returns DFTI_BAD_DESCRIPTOR and doesn't change the input data.

Ironically, we can't catch this error by checking, as in FFT_2D_openmp_wrong.f90, sums for a constant signal v=(v[1], v[2], ..., v)where v =v=c for all i and j, because FFT(v) = (c*N, 0, 0, ..., 0) in this case.
If DftiComputeForward succeeds, then v is replaced with FFT(v) and we get sum(FFT(v)) = c*N.
If DftiComputeForward fails, then v isn't changed and we getsum(v) = c*N.
Hence you see the same sum in the sequential and "parallel" case...

You may find useful the following Knowledge Base articlehttp://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/about parallelization of (2D) FFTs.
Given only FFT_2D_openmp_wrong.f90 and in FFT_2D_openmp.f90, it's hard to tell what are your needs and what would be the correct use of MKL FFT for you.

clodxp · ‎05-11-2010

Thank you very much!!!
Private variables are not initialized!! I always forget about it!

Clodxp