issue of 2d fft with openmp

Wang__Xin · ‎05-01-2018

Hi, I'm having trouble with 2d fft using intel 2018 compiler. I tracked it down and reproduce the behavior in this simple code. The sample code applies forward and backward 2d fft, and is supposed to bring back the original value. But if I do

setenv MKL_DYNAMIC false

setenv OMP_NUM_THREADS 10

and run the code, the result is wrong.

Here is my sample code. Any insight is appreciated. Thank you.

ifort version: 18.0.2 20180210

MKL version: ics-2018.update.2/compilers_and_libraries_2018.2.199

Xin

Program fft2dcc

! to compile the code, do the following:

! ifort -c -DMKLI8 -qopenmp -I${MKLROOT}/include/intel64/ilp64 -O3 -o fft2dcc.o fft2dcc.F90

! ifort -DMKLI8 -static-intel -qopenmp -O3 -o fft2dcc fft2dcc.o -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group

use MKL_DFTI

use omp_lib

implicit none

integer :: nx, ny, nw, nfreq, ier, iw, iy, ix

complex, allocatable, dimension(:,:,:) :: wave

integer :: nthreads

type(DFTI_DESCRIPTOR), pointer :: plan

real :: fscale, bscale

nx = 640

ny = 640

nw = 224

nthreads = omp_get_num_threads()

fscale = 1.0

bscale = 1.0/float(nx*ny)

ier = ccfft2d_plan(nx,ny,nx,fscale,bscale,plan,nthreads)

allocate(wave(nx,ny,nw))

!$omp parallel

!$omp master

print *, omp_get_num_threads()

!$omp end master

!$omp end parallel

! first touch memory

!$omp parallel do default(shared), private(ix,iy,iw)

do iw = 1, nw

do iy = 1, ny

do ix = 1, nx

wave(ix,iy,iw) = cmplx(0.0,0.0)

enddo

! inject source

wave(nx/2,ny/2,iw) = cmplx(1.0,0.0)

enddo

print *

print *, "Before", wave(318:322,320,1)

print *

!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(iw)

do iw=1,nw

call ccfft2d_exe(1,wave(1,1,iw),plan)

call ccfft2d_exe(-1,wave(1,1,iw),plan)

enddo

print *, "After", wave(318:322,320,1)

deallocate(wave)

contains

integer function ccfft2d_plan &

( n1, n2, ldn1, fscale, bscale, plan, nthreads )

use MKL_DFTI

implicit none

integer, intent(in), optional :: nthreads

integer, intent(in) :: n1, n2, ldn1

real, intent(in) :: fscale, bscale

type(DFTI_DESCRIPTOR), pointer :: plan

integer :: ier

#ifdef MKLI8

integer(kind=8) :: dim, nsize(2), strides(3), nthreads_in, status

#else

integer(kind=4) :: dim, nsize(2), strides(3), nthreads_in, status

#endif

ier = 0

!--- add some error checking for sizes

dim = 2

nsize(1) = n1

nsize(2) = n2

strides(1) = 0

strides(2) = 1

strides(3) = ldn1

status = DftiCreateDescriptor(plan,DFTI_SINGLE,DFTI_COMPLEX,dim,nsize(1:dim))

if (.NOT. DftiErrorClass(Status, DFTI_NO_ERROR)) then

print *, 'error 1'

call DftiStatusPrint(Status)

ier = -1

return

endif

nthreads_in = 1

if (present(nthreads)) nthreads_in = nthreads

status = DftiSetValue(plan,DFTI_NUMBER_OF_USER_THREADS,nthreads_in)

status = DftiSetValue(plan,DFTI_INPUT_STRIDES,strides)

!--- this corresponding to isign < 0 usage at BP

! bscale= 1.0/float(n1*n2)

status = DftiSetValue(plan,DFTI_FORWARD_SCALE,bscale)

!--- this corresponding to isign > 0 usage at BP

! fscale= 1.0

status = DftiSetValue(plan,DFTI_BACKWARD_SCALE,fscale)

status = DftiCommitDescriptor(plan)

if (status /= 0) print *, 'DftiCommitDescriptor returned ', status

ccfft2d_plan = ier

end function ccfft2d_plan

subroutine ccfft2d_exe( isign, cdata, plan )

use MKL_DFTI

implicit none

integer, intent(in) :: isign

complex, intent(inout) :: cdata(*)

type(DFTI_DESCRIPTOR), pointer :: plan

#ifdef MKLI8

integer(kind=8) :: status

#else

integer(kind=4) :: status

#endif

if ( isign < 0 ) then

status = DftiComputeForward( plan, cdata )

else

status = DftiComputeBackward( plan, cdata )

endif

end subroutine ccfft2d_exe

subroutine DftiStatusPrint(status)

use MKL_DFTI

implicit none

#ifdef MKLI8

integer(kind=8) :: status

#else

integer(kind=4) :: status

#endif

character(DFTI_MAX_MESSAGE_LENGTH) :: error_message

error_message = DftiErrorMessage(status)

print *, 'Error message: ', error_message

print *, 'Error status = ', status

end subroutine DftiStatusPrint

end Program

vahid_a_ · ‎05-01-2018

I tried to compile your code but I received compilation errors with the first line of your proposed compilation. Other people may be able to help you. Here is your organized code for better interpretation:

Program fft2dcc

    ! to compile the code, do the following:
    ! ifort -c -DMKLI8 -qopenmp -I${MKLROOT}/include/intel64/ilp64 -O3 -o fft2dcc.o fft2dcc.F90
    ! ifort -DMKLI8 -static-intel -qopenmp -O3 -o fft2dcc fft2dcc.o  -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group

    use MKL_DFTI
    use omp_lib

    implicit none

    integer :: nx, ny, nw, nfreq, ier, iw, iy, ix
    complex, allocatable, dimension(:,:,:) :: wave
    integer :: nthreads
    type(DFTI_DESCRIPTOR), pointer :: plan
    real :: fscale, bscale

    nx = 640
    ny = 640
    nw = 224

    nthreads = omp_get_num_threads()
    fscale = 1.0
    bscale = 1.0/float(nx*ny)
    ier = ccfft2d_plan(nx,ny,nx,fscale,bscale,plan,nthreads)

    allocate(wave(nx,ny,nw))

    !$omp parallel
    !$omp master

    print *, omp_get_num_threads()

    !$omp end master
    !$omp end parallel

    ! first touch memory
    !$omp parallel do default(shared), private(ix,iy,iw)

    do iw = 1, nw
        do iy = 1, ny
            do ix = 1, nx
                wave(ix,iy,iw) = cmplx(0.0,0.0)
            enddo
        enddo
              ! inject source
        wave(nx/2,ny/2,iw) = cmplx(1.0,0.0)
    enddo

 

    print *
    print *, "Before", wave(318:322,320,1)
    print *

    !$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(iw)

    do iw=1,nw
        call ccfft2d_exe(1,wave(1,1,iw),plan)
        call ccfft2d_exe(-1,wave(1,1,iw),plan)
    enddo

    print *, "After", wave(318:322,320,1)

    deallocate(wave)

contains

    integer function ccfft2d_plan &
    ( n1, n2, ldn1, fscale, bscale, plan, nthreads )

        use MKL_DFTI
        implicit none

        integer, intent(in), optional :: nthreads
        integer, intent(in)  :: n1, n2, ldn1
        real,    intent(in)  :: fscale, bscale
        type(DFTI_DESCRIPTOR), pointer :: plan

        integer :: ier

#ifdef MKLI8

        integer(kind=8) :: dim, nsize(2), strides(3), nthreads_in, status

#else

        integer(kind=4) :: dim, nsize(2), strides(3), nthreads_in, status

#endif

 

        ier = 0
        !--- add some error checking for sizes
        dim = 2

        nsize(1) = n1
        nsize(2) = n2
        strides(1) = 0
        strides(2) = 1
        strides(3) = ldn1

        status = DftiCreateDescriptor(plan,DFTI_SINGLE,DFTI_COMPLEX,dim,nsize(1:dim))

        if (.NOT. DftiErrorClass(Status, DFTI_NO_ERROR)) then

            print *, 'error 1'
            call DftiStatusPrint(Status)
            ier = -1
            return

        endif


        nthreads_in = 1
        if (present(nthreads)) nthreads_in = nthreads
        status = DftiSetValue(plan,DFTI_NUMBER_OF_USER_THREADS,nthreads_in)
        status = DftiSetValue(plan,DFTI_INPUT_STRIDES,strides)

        !--- this corresponding to isign < 0 usage at BP
        ! bscale= 1.0/float(n1*n2)
        status = DftiSetValue(plan,DFTI_FORWARD_SCALE,bscale)
        !--- this corresponding to isign > 0 usage at BP
        ! fscale= 1.0
        status = DftiSetValue(plan,DFTI_BACKWARD_SCALE,fscale)
        status = DftiCommitDescriptor(plan)

        if (status /= 0) print *, 'DftiCommitDescriptor returned ', status

        ccfft2d_plan = ier

    end function ccfft2d_plan
 
    subroutine ccfft2d_exe( isign, cdata, plan )

        use MKL_DFTI
        implicit none
        integer, intent(in)  :: isign
        complex, intent(inout) :: cdata(*)
        type(DFTI_DESCRIPTOR), pointer :: plan


#ifdef MKLI8

        integer(kind=8) :: status

#else

        integer(kind=4) :: status

#endif

        if ( isign < 0 ) then
            status = DftiComputeForward( plan, cdata )
        else
            status = DftiComputeBackward( plan, cdata )
        endif

    end subroutine ccfft2d_exe

    subroutine DftiStatusPrint(status)

        use MKL_DFTI
        implicit none

#ifdef MKLI8

        integer(kind=8) :: status

#else

        integer(kind=4) :: status

#endif

        character(DFTI_MAX_MESSAGE_LENGTH) :: error_message

 

        error_message = DftiErrorMessage(status)
        print *, 'Error message: ', error_message
        print *, 'Error status = ', status

    end subroutine DftiStatusPrint

end Program

Ying_H_Intel · ‎05-02-2018

Hi Xin, Vahid,
Thank you for the reports.

I try to build the code and attach some trick at the end of message. Basically i can reproduce the problem Xin reported. .
It looks MKL_DYNAMIC=false or true problems,

Some Intel MKL routines may use fewer OpenMP threads than suggested by the threading controls if
either the underlying algorithms do not support the suggested number of OpenMP threads or the
routines perform better with fewer OpenMP threads because of lower OpenMP overhead and/or better
data locality. Set the MKL_DYNAMIC environment variable to FALSE or call mkl_set_dynamic(0) to
use the suggested number of OpenMP threads whenever the algorithms permit and regardless of
OpenMP overhead and data locality.

But as we discussed in https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/759099

The plan is not thead safe, it may need private for each thread (include openMP thread).

Best Regards,
Ying

Ying_H_Intel · ‎05-02-2018

FYI, Here is my compile command:

source /opt/intel/compilers_and_libraries_2018.2.199/linux/bin/compilervars.sh intel64

ifort -module ./ -i8 -fpp -warn all -warn errors -qopenmp -c ${MKLROOT}/include/mkl_dfti.f90 -o mkl_dfti.o

ifort -DMKLI8 -static-intel -qopenmp -i8 -O3 -o fft2dcc FFT2dcc2.f90 -module ./ -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group

Wang__Xin · ‎05-02-2018

Ying, thanks for the reply.

The MKL manual illustrates four ways to parallelize FFT application. The sample code I provided mimics one of the techniques. You can find the example in the manual by searching 'Using Parallel Mode with a Common Descriptor'

Ying_H_Intel · ‎05-03-2018

Hi Xin,

Right, the example 'Using Parallel Mode with a Common Descriptor' are supposed for that purpose. originally , i had supposed possible out-sync in forward and Backforward.

do iw=1,nw
        call ccfft2d_exe(1,wave(1,1,iw),plan)
        call ccfft2d_exe(-1,wave(1,1,iw),plan)
    enddo
further test to separate them

!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(iw)
    do iw=1,nw
        call ccfft2d_exe(1,wave(1,1,iw),plan)

    enddo

!   !$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(iw)
    do iw=1,nw
call ccfft2d_exe(-1,wave(1,1,iw),plan)
     enddo
But the issue is still existing., if wit MKL_DYNAMIC=true, everything looks fine, if with MKL_DYNAMIC=false, then wrong result.

So the problem should not here , we will look into details about the implementation.

Best Regards,
Ying

Ying_H_Intel · ‎10-08-2018

Hi Xin,

Our developer team should fix the problem in MKL 2019 version. which is available in IRC or YUM, APT now.

Could you please to try it and let us know if any feedback.

Thanks

Ying

P.S How to get Intel MKL etc:

https://software.intel.com/en-us/articles/how-to-get-intel-mklippdaal