- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I've collected an example wherein a VERY STRANGE BEHAVIOR happens: the use of an FFT within an OpenMP cycle, with OMP_NUM_THREADS=1, seems to go about 10 time faster than the serial version!!

**The code essentially is made by a cycle wherein an FFT is performed. I would like that each thread would perform a part of the M FFTS to be computed.**

*CYCLE SERIAL VERSION -------------------------*

do xx=1,M

!initalize data to be transformed

data_fft=xx+imag*xx*2.

! Perform FFT

Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements

sum_vect(xx)=sum(data_fft)

end do

!----------------------------------------------------------

do xx=1,M

!initalize data to be transformed

data_fft=xx+imag*xx*2.

! Perform FFT

Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements

sum_vect(xx)=sum(data_fft)

end do

!----------------------------------------------------------

I've parallelized this cycle in two version: correct and uncorrect.

In the first one, the correct one (see FFT_2D_openmp.f90 attached) the cycle is parallelized as follows

*CYCLE PARALLEL VERSION -----------------------------*

!$omp parallel

!$omp do private(data_fft) schedule(static,num_it_schedule)

do xx=1,M

!initalize data to be transformed

data_fft=xx+imag*xx*2.

! Perform FFT

Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements

sum_vect(xx)=sum(data_fft)

end do

!$omp end do

!$omp end parallel

--------------------------------------------------------------

!$omp parallel

!$omp do private(data_fft) schedule(static,num_it_schedule)

do xx=1,M

!initalize data to be transformed

data_fft=xx+imag*xx*2.

! Perform FFT

Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements

sum_vect(xx)=sum(data_fft)

end do

!$omp end do

!$omp end parallel

--------------------------------------------------------------

To obtain a correct functioning it is necessary to set DFTI_NUMBER_OF_USER_THREADS=to the number of running threads.

OK!The correct version has been obtained after the wrong on (see FFT_2D_openmp_wrong.f90), showing the strange behavior.

In the wrong version i've made an error, since a declared as private in the cycle also the status and handle of the FFT:

*CYCLE PARALLEL VERSION (WRONG!!) -----------------------------------------------------*

!$omp parallel

!$omp do private(data_fft,Status,Desc_Handle) schedule(static,num_it_schedule)

do xx=1,M

!initalize data to be transformed

data_fft=xx+imag*xx*2.

! Perform FFT

Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements

sum_vect(xx)=sum(data_fft)

end do

!$omp end do

!$omp end parallel

---------------------------------------------------------------------------------------------------

!$omp parallel

!$omp do private(data_fft,Status,Desc_Handle) schedule(static,num_it_schedule)

do xx=1,M

!initalize data to be transformed

data_fft=xx+imag*xx*2.

! Perform FFT

Status=DftiComputeForward(Desc_Handle,data_fft)

!perform sum of the elements

sum_vect(xx)=sum(data_fft)

end do

!$omp end do

!$omp end parallel

---------------------------------------------------------------------------------------------------

Unfortunately, the result (the sum of the element of sum_vect) is correct (if compared to the results of the serial version) and the time is about 10 time lower!!!

This is the execution of FFT_2D_openmp_wrong.f90 on my machine (Mac Pro 8-core).

*---------- S t a r t*

+ Matrix size Nx,Ny = 300.0000 300.0000

+ Cycle over M = 100.0000

--> SERIAL

+ Serial execution time = 0.120011000006343

+ Serial result (sum) = (4.5450000E+08,9.0900000E+08)

--> PARALLEL

+ Number of threads = 1

+ Parallel execution time = 9.684999997261912E-003

+ Parallel result (sum) = (4.5450000E+08,9.0900000E+08)

--> SPEEDUP (ideal = nthread) = 12.3914300506218

--> EFFICIENCY (ideal =1) = 12.3914300506218

---------- S t o p

+ Matrix size Nx,Ny = 300.0000 300.0000

+ Cycle over M = 100.0000

--> SERIAL

+ Serial execution time = 0.120011000006343

+ Serial result (sum) = (4.5450000E+08,9.0900000E+08)

--> PARALLEL

+ Number of threads = 1

+ Parallel execution time = 9.684999997261912E-003

+ Parallel result (sum) = (4.5450000E+08,9.0900000E+08)

--> SPEEDUP (ideal = nthread) = 12.3914300506218

--> EFFICIENCY (ideal =1) = 12.3914300506218

---------- S t o p

The parallel execution time is about 12 time lower than the serial one, while a correct working is obtained with FFT_2D_openmp.f90.

Can someone explain this??!

And please can confirm the correct use of the MKL FFT for my needs??

Thanks

Clodxp

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Clodxp,

The initial value of a private variable in an OpenMP sectionis undefined (FORTRAN OpenMP 2.0 http://www.openmp.org/mp-documents/fspec20.pdf, p. 35; OpenMP 3.0 http://www.openmp.org/mp-documents/spec30.pdf, p.90).

Hence each call to DftiComputeForward in the parallel part of FFT_2D_openmp_wrong.f90 returns DFTI_BAD_DESCRIPTOR and doesn't change the input data.

Ironically, we can't catch this error by checking, as in FFT_2D_openmp_wrong.f90, sums for a *constant *signal v=(v[1], v[2], ..., v* =v =c for all i and j, because FFT(v) = (c*N, 0, 0, ..., 0) in this case.*

If DftiComputeForward succeeds, then v is replaced with FFT(v) and we get sum(FFT(v)) = c*N.

If DftiComputeForward fails, then v isn't changed and we getsum(v) = c*N.

Hence you see the same sum in the sequential and "parallel" case...

You may find useful the following Knowledge Base articlehttp://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/about parallelization of (2D) FFTs.

Given only FFT_2D_openmp_wrong.f90 and in FFT_2D_openmp.f90, it's hard to tell what are your needs and what would be the correct use of MKL FFT for you.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Clodxp,

The initial value of a private variable in an OpenMP sectionis undefined (FORTRAN OpenMP 2.0 http://www.openmp.org/mp-documents/fspec20.pdf, p. 35; OpenMP 3.0 http://www.openmp.org/mp-documents/spec30.pdf, p.90).

Hence each call to DftiComputeForward in the parallel part of FFT_2D_openmp_wrong.f90 returns DFTI_BAD_DESCRIPTOR and doesn't change the input data.

Ironically, we can't catch this error by checking, as in FFT_2D_openmp_wrong.f90, sums for a *constant *signal v=(v[1], v[2], ..., v* =v =c for all i and j, because FFT(v) = (c*N, 0, 0, ..., 0) in this case.*

If DftiComputeForward succeeds, then v is replaced with FFT(v) and we get sum(FFT(v)) = c*N.

If DftiComputeForward fails, then v isn't changed and we getsum(v) = c*N.

Hence you see the same sum in the sequential and "parallel" case...

You may find useful the following Knowledge Base articlehttp://software.intel.com/en-us/articles/different-parallelization-techniques-and-intel-mkl-fft/about parallelization of (2D) FFTs.

Given only FFT_2D_openmp_wrong.f90 and in FFT_2D_openmp.f90, it's hard to tell what are your needs and what would be the correct use of MKL FFT for you.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Private variables are not initialized!! I always forget about it!

Clodxp

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page