Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Multidimensional DFT and OpenMP

Thomas_D_1
Beginner
908 Views

I'm working on a program that performs several 3 x 3d (N1xN2xN3) DFTs using the MKL DFT algorithm. I'm running most of the program in parallel using OpenMP and I'd like to get as much parallel performance from the DFT section as well as it accounts for a significant portion of the programs runtime. However when I try to increase the number of threads I find that the performance improvement plateaus at 3 threads, i.e., the number of transforms for each call. If instead I break up the transform into 3xN1 2d transforms the parallel performance continues to scale beyond 3 threads. This seems like a lot of extra work for performance gains I would expect to be handled internally. Is there a way of directing MKL's DFT to do this on it's own?

As it may be relevant, I'm already passing the number of available threads to the DFT via the DFTI_NUMBER_OF_THREADS DftiSetValue option and each of the 3 3d tranforms is done by setting the DFTI_NUMBER_OF_TRANSFORMS option. I can provide some pseudo code if that would be helpful.

0 Kudos
1 Solution
Evgueni_P_Intel
Employee
908 Views

Hi Thomas,

The DFTI_NUMBER_OF_TRANSFORMS setting is tuned for large transposed batches.

To improve scaling in your case, I have changed the descriptors to do 1 transform at a time in a loop in TEST_DFT_r2r_3D.

I have also moved DftiCommitDescriptor out of the loop.

Now it scales.

Evgueni.

 

!=========================================================================================
! TEST_DFT_r2r_3D:
!
!   Performs a number of forward and backward DFTs using 3d transforms.
!
  subroutine TEST_DFT_r2r_3D( array, MK, MX, MY, MZ, NLOOPS )
    real(DFT_RK), intent(inout) :: array(:,:,:,:)
    integer(IK) , intent(in)    :: MK, MX, MY, MZ
    integer(IK) , intent(in)    :: NLOOPS
    ! local variables:
    integer(IK) :: cFinish, cStart, cRate, cMax
    type(DFTI_DESCRIPTOR), pointer :: descrB, descrF
    integer(IK) :: i, k
    integer(IK) :: status
    real(RK)    :: start, finish, time

    ! create backward descriptor
    print *,"Creating DFTI Backward descriptor"
    status = DftiCreateDescriptor(descrB, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ])
    !status = DftiSetValue(descrB, DFTI_NUMBER_OF_USER_THREADS, nThreads)
    status = DftiSetValue(descrB, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)
    status = DftiSetValue(descrB, DFTI_BACKWARD_SCALE, 1.0/(MX*MY*MZ))
    !status = DftiSetValue(descrB, DFTI_NUMBER_OF_TRANSFORMS, MK)
    !status = DftiSetValue(descrB, DFTI_INPUT_DISTANCE ,   (MX/2 + 1)*MY*MZ)
    !status = DftiSetValue(descrB, DFTI_OUTPUT_DISTANCE, 2*(MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrB, DFTI_INPUT_STRIDES , [0,1,  (MX/2 + 1),  (MX/2 + 1)*MY])
    status = DftiSetValue(descrB, DFTI_OUTPUT_STRIDES, [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY])

    ! create forward descriptor
    print *,"Creating DFTI Forward descriptor"
    status = DftiCreateDescriptor(descrF, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ])
    !status = DftiSetValue(descrF, DFTI_NUMBER_OF_USER_THREADS, nThreads)
    status = DftiSetValue(descrF, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)
    !status = DftiSetValue(descrF, DFTI_NUMBER_OF_TRANSFORMS, MK)
    !status = DftiSetValue(descrF, DFTI_INPUT_DISTANCE , 2*(MX/2 + 1)*MY*MZ)
    !status = DftiSetValue(descrF, DFTI_OUTPUT_DISTANCE,   (MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrF, DFTI_INPUT_STRIDES , [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY])
    status = DftiSetValue(descrF, DFTI_OUTPUT_STRIDES, [0,1,  (MX/2 + 1),  (MX/2 + 1)*MY])

    status = DftiCommitDescriptor(descrF)
    status = DftiCommitDescriptor(descrB)

    ! perform transforms
    write(6,'("Starting loop...")')
    call system_clock(cStart, cRate, cMax)
    call cpu_time(start)
    do i = 1, NLOOPS
    do k = 1, MK
      status = DftiComputeForward(descrF, array(:,1,1,k))
      status = DftiComputeBackward(descrB, array(:,1,1,k))
    end do
    end do
    call system_clock(cFinish, cRate, cMax)
    call cpu_time(finish)

    ! output results
    write(6,'("  CPU_TIME = "ES26.16)') finish - start
    if (cFinish < cStart) cFinish = cFinish + cMax
    write(6,'("  SYSTEM_CLOCK = "ES26.16)') real(cFinish - cStart)/real(cRate)

    ! cleanup
    status = DftiFreeDescriptor(descrB)
    status = DftiFreeDescriptor(descrF)
  end subroutine TEST_DFT_r2r_3D

 

View solution in original post

0 Kudos
9 Replies
Evgueni_P_Intel
Employee
908 Views

Hi Thomas,

Please post the pseudo code here.

Evgueni.

0 Kudos
Thomas_D_1
Beginner
908 Views

I've included the basic code layout below. I've tried running this kind of program with the number of OpenMP threads ranging from 1 to 8 and the runtime stops improving after 3 threads.

program main
  ! defining variables and stuff
  real :: array(2*(Nx/2 + 1),Ny,Nz,3)
  ...

  nThreads = OMP_GET_NUM_THREADS()
  call OMP_SET_NESTED(.TRUE.)

  ! create backward descriptor
  status = DftiCreateDescriptor(descr, DFTI_SINGLE, DFTI_REAL, 3, [Nx,Ny,Nz])
  status = DftiSetValue(Bdescr, DFTI_NUMBER_OF_USER_THREADS, nThreads)
  status = DftiSetValue(Bdescr, DFTI_NUMBER_OF_TRANSFORMS, 3)
  status = DftiSetValue(Bdescr, DFTI_INPUT_DISTANCE, 2*(Nx/2 + 1)*Ny*Nz)
  status = DftiSetValue(Bdescr, DFTI_OUTPUT_DISTANCE, 2*(Nx/2 + 1)*Ny*Nz)
  status = DftiSetValue(Bdescr, DFTI_INPUT_STRIDES, [0,1,2*(Nx/2 + 1),2*(Nx/2 + 1)*Ny])
  status = DftiSetValue(Bdescr, DFTI_OUTPUT_STRIDES, [0,1,2*(Nx/2 + 1),2*(Nx/2 + 1)*Ny])


  ! create forward descriptor
  status = DftiCreateDescriptor(descr, DFTI_SINGLE, DFTI_REAL, 3, [Nx,Ny,Nz])
  status = DftiSetValue(Fdescr, DFTI_NUMBER_OF_USER_THREADS, nThreads)
  status = DftiSetValue(Fdescr, DFTI_NUMBER_OF_TRANSFORMS, 3)
  status = DftiSetValue(Fdescr, DFTI_INPUT_DISTANCE, 2*(Nx/2 + 1)*Ny*Nz)
  status = DftiSetValue(Fdescr, DFTI_OUTPUT_DISTANCE, (Nx/2 + 1)*Ny*Nz)
  status = DftiSetValue(Fdescr, DFTI_INPUT_STRIDES, [0,1,2*(Nx/2 + 1),2*(Nx/2 + 1)*Ny])
  status = DftiSetValue(Fdescr, DFTI_OUTPUT_STRIDES, [0,1,(Nx/2 + 1),(Nx/2 + 1)*Ny])

  ! initialize array
  array = 1.0
  
  ! do many loops of forward and backward DFT
  ! start timer
  do i = 1, nLoops
    status = DftiCommitDescriptor(Fdescr)
    status = DftiComputeForward(Fdescr, array(:,1,1,1))
    status = DftiCommitDescriptor(Bdescr)
    status = DftiComputeBackward(Bdescr, array(:,1,1,1))
  end do
  ! end timer

end program main

 

0 Kudos
Ying_H_Intel
Employee
908 Views

Hi Thomas,

You mentioned "I'm running most of the program in parallel using OpenMP".  so as I understand, you should have some OpenMP external  thread at DFT call or aka nested parallel model.  But it seems not in the code. 

 It is ok    DFTI_NUMBER_OF_TRANSFORMS, 3 as you will 3 x 3d (N1xN2xN3) DFTs.   so your performance result are based on 3 threads.

But  DFTI_NUMBER_OF_USER_THREADS should be no used at latest version

DFTI_NUMBER_OF_USER_THREADS
The DFTI_NUMBER_OF_USER_THREADS configuration parameter is no longer used and kept for compatibility
with previous versions of Intel MKL.

Could you please tell what MKL version and command line  and your hardware configuration? 

Best Regards,

Ying

0 Kudos
Thomas_D_1
Beginner
908 Views

Thanks for the reply. The DFT in my program isn't embedded in a parallel section. But I would like to run it in parallel as it accounts for a significant percent of the total calculation. The structure of the programs looks like this

program main
  ! initialization stuff
  ...

  do
    ! perform lots of parallel calculations
    ...

    ! perform forward DFT
    call DFTForward(descrF, array(:,1,1,1))
    
    ! perform parallel calculations with DFT of array
    ...

    ! perform backward DFT
    call DFTBackward(descrB, array(:,1,1,1))

    ! perform more parallel calculations
    ...
  end do
end program main

I'm using ifort version 14.0.2 20140120. Not sure which version of MKL that has. As for hardware I'm working on a Linux cluster with AMD Opteron 6212 Processors each with 16 CPUs.

0 Kudos
Ying_H_Intel
Employee
908 Views

Hi Thomas, 

Thanks for the clarification.  You haven't show how you compile the code and what the exact performance data vs threading number, not sure i get all of them, but please try. 

1. Could  you remove the limitation status = DftiSetValue(Bdescr, DFTI_NUMBER_OF_USER_THREADS, nThreads) and try again?

Please let us know the result. 

2. In addition,   there are sample code under MKL install directory. MKLexample\dftf\source\config_number_of_transforms.f90

and config_number_of_user_threads.f90. you may test the performance of MKL DFT part.  then consider total parallel part? 

3. Or may you try  insert OMP threading control code in the parallel calculation,of your code  and obverse the behavious of thread number

do
06

 call omp_get_num_procs() ;

call omp_set_num_threads(8)  //8 physical core, right?

 

! perform lots of parallel calculations

07     ...
08  
09     ! perform forward DFT

 

call omp_set_num_threads(3)

10     call DFTForward(descrF, array(:,1,1,1))
11      
12     ! perform parallel calculations with DFT of array
13     ...
14  
15     ! perform backward DFT
16     call DFTBackward(descrB, array(:,1,1,1))

 

call omp_set_num_threads(8)

17  
18     ! perform more parallel calculations
19     ...
20   end do

 

Please let us know the result. 

Best Regards,

Ying 

0 Kudos
Thomas_D_1
Beginner
908 Views

Since pseudo-code doesn't appear to be doing the trick I boiled my program down to a simple example listed here

!=========================================================================================
program main
!=========================================================================================
  use, intrinsic :: iso_fortran_env
  use mkl_dfti
  use omp_lib
  implicit none
  ! local parameters:
  integer(IK), parameter :: IK = 4
  integer(IK), parameter :: RK = 8
  integer(IK), parameter :: DFT_CK = 4
  integer(IK), parameter :: DFT_RK = 4
  integer(IK), parameter :: NLOOPS = 100
  integer(IK), parameter :: NX = 100
  integer(IK), parameter :: NY = 40
  integer(IK), parameter :: NZ = 60
  integer(IK), parameter :: MX = 2*NX - 1
  integer(IK), parameter :: MY = 2*NY - 1
  integer(IK), parameter :: MZ = 2*NZ - 1
  ! local variables:
  real(DFT_RK), allocatable :: array(:,:,:,:)
  integer(IK) :: status

  write(6,'("Number of processors = "I0)') OMP_GET_NUM_PROCS()
  write(6,'("Maximum number of available threads = "I0)') OMP_GET_MAX_THREADS()
  write(6,'(A)') ''

  write(6,'("Generating array...")')
  allocate(array(2*(MX/2+1),MY,MZ,3))
  array = 1.0
  write(6,'(A)') ''

  write(6,'("TEST DFT_r2r_3D...")')

  write(6,'("Number of OMP threads = "I0)') 1
  call omp_set_num_threads(1)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 2
  call omp_set_num_threads(2)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 3
  call omp_set_num_threads(3)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 4
  call omp_set_num_threads(4)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 5
  call omp_set_num_threads(5)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 6
  call omp_set_num_threads(6)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 7
  call omp_set_num_threads(7)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

  write(6,'("Number of OMP threads = "I0)') 8
  call omp_set_num_threads(8)
  call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS)
  write(6,'(A)') ''

contains

!=========================================================================================
! TEST_DFT_r2r_3D:
!
!   Performs a number of forward and backward DFTs using 3d transforms.
!
  subroutine TEST_DFT_r2r_3D( array, MK, MX, MY, MZ, NLOOPS )
    real(DFT_RK), intent(inout) :: array(:,:,:,:)
    integer(IK) , intent(in)    :: MK, MX, MY, MZ
    integer(IK) , intent(in)    :: NLOOPS
    ! local variables:
    integer(IK) :: cFinish, cStart, cRate, cMax
    type(DFTI_DESCRIPTOR), pointer :: descrB, descrF
    integer(IK) :: i
    integer(IK) :: status
    real(RK)    :: start, finish, time

    ! create backward descriptor
    print *,"Creating DFTI Backward descriptor"
    status = DftiCreateDescriptor(descrB, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ])
    !status = DftiSetValue(descrB, DFTI_NUMBER_OF_USER_THREADS, nThreads)
    status = DftiSetValue(descrB, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)
    status = DftiSetValue(descrB, DFTI_BACKWARD_SCALE, 1.0/(MX*MY*MZ))
    status = DftiSetValue(descrB, DFTI_NUMBER_OF_TRANSFORMS, MK)
    status = DftiSetValue(descrB, DFTI_INPUT_DISTANCE ,   (MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrB, DFTI_OUTPUT_DISTANCE, 2*(MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrB, DFTI_INPUT_STRIDES , [0,1,  (MX/2 + 1),  (MX/2 + 1)*MY])
    status = DftiSetValue(descrB, DFTI_OUTPUT_STRIDES, [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY])

    ! create forward descriptor
    print *,"Creating DFTI Forward descriptor"
    status = DftiCreateDescriptor(descrF, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ])
    !status = DftiSetValue(descrF, DFTI_NUMBER_OF_USER_THREADS, nThreads)
    status = DftiSetValue(descrF, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)
    status = DftiSetValue(descrF, DFTI_NUMBER_OF_TRANSFORMS, MK)
    status = DftiSetValue(descrF, DFTI_INPUT_DISTANCE , 2*(MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrF, DFTI_OUTPUT_DISTANCE,   (MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrF, DFTI_INPUT_STRIDES , [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY])
    status = DftiSetValue(descrF, DFTI_OUTPUT_STRIDES, [0,1,  (MX/2 + 1),  (MX/2 + 1)*MY])

    ! perform transforms
    write(6,'("Starting loop...")')
    call system_clock(cStart, cRate, cMax)
    call cpu_time(start)
    do i = 1, NLOOPS
      status = DftiCommitDescriptor(descrF)
      status = DftiComputeForward(descrF, array(:,1,1,1))
      status = DftiCommitDescriptor(descrB)
      status = DftiComputeBackward(descrB, array(:,1,1,1))
    end do
    call system_clock(cFinish, cRate, cMax)
    call cpu_time(finish)

    ! output results
    write(6,'("  CPU_TIME = "ES26.16)') finish - start
    if (cFinish < cStart) cFinish = cFinish + cMax
    write(6,'("  SYSTEM_CLOCK = "ES26.16)') real(cFinish - cStart)/real(cRate)

    ! cleanup
    status = DftiFreeDescriptor(descrB)
    status = DftiFreeDescriptor(descrF)
  end subroutine TEST_DFT_r2r_3D
!=========================================================================================
end program main
!=========================================================================================

I'm compiling with the compiler flags

FFLAGS = -O3 -msse2 -openmp -parallel

Compiling and running this program gives 

Number of processors = 16
Maximum number of available threads = 8

Generating array...

TEST DFT_r2r_3D...
Number of OMP threads = 1
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.7115277000000006E+01
  SYSTEM_CLOCK =     7.5663902282714844E+01

Number of OMP threads = 2
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5902460999999988E+01
  SYSTEM_CLOCK =     3.7973400115966797E+01

Number of OMP threads = 3
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5313551000000018E+01
  SYSTEM_CLOCK =     2.5120399475097656E+01

Number of OMP threads = 4
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5515520000000009E+01
  SYSTEM_CLOCK =     2.5186500549316406E+01

Number of OMP threads = 5
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5352544999999964E+01
  SYSTEM_CLOCK =     2.5134099960327148E+01

Number of OMP threads = 6
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5456527999999992E+01
  SYSTEM_CLOCK =     2.5168199539184570E+01

Number of OMP threads = 7
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5415534999999977E+01
  SYSTEM_CLOCK =     2.5154600143432617E+01

Number of OMP threads = 8
 Creating DFTI Backward descriptor
 Creating DFTI Forward descriptor
Starting loop...
  CPU_TIME =     7.5428532999999902E+01
  SYSTEM_CLOCK =     2.5156799316406250E+01

So the program runtime plateaus at about 25 seconds after only 3 threads are given to OpenMP.

0 Kudos
Evgueni_P_Intel
Employee
909 Views

Hi Thomas,

The DFTI_NUMBER_OF_TRANSFORMS setting is tuned for large transposed batches.

To improve scaling in your case, I have changed the descriptors to do 1 transform at a time in a loop in TEST_DFT_r2r_3D.

I have also moved DftiCommitDescriptor out of the loop.

Now it scales.

Evgueni.

 

!=========================================================================================
! TEST_DFT_r2r_3D:
!
!   Performs a number of forward and backward DFTs using 3d transforms.
!
  subroutine TEST_DFT_r2r_3D( array, MK, MX, MY, MZ, NLOOPS )
    real(DFT_RK), intent(inout) :: array(:,:,:,:)
    integer(IK) , intent(in)    :: MK, MX, MY, MZ
    integer(IK) , intent(in)    :: NLOOPS
    ! local variables:
    integer(IK) :: cFinish, cStart, cRate, cMax
    type(DFTI_DESCRIPTOR), pointer :: descrB, descrF
    integer(IK) :: i, k
    integer(IK) :: status
    real(RK)    :: start, finish, time

    ! create backward descriptor
    print *,"Creating DFTI Backward descriptor"
    status = DftiCreateDescriptor(descrB, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ])
    !status = DftiSetValue(descrB, DFTI_NUMBER_OF_USER_THREADS, nThreads)
    status = DftiSetValue(descrB, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)
    status = DftiSetValue(descrB, DFTI_BACKWARD_SCALE, 1.0/(MX*MY*MZ))
    !status = DftiSetValue(descrB, DFTI_NUMBER_OF_TRANSFORMS, MK)
    !status = DftiSetValue(descrB, DFTI_INPUT_DISTANCE ,   (MX/2 + 1)*MY*MZ)
    !status = DftiSetValue(descrB, DFTI_OUTPUT_DISTANCE, 2*(MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrB, DFTI_INPUT_STRIDES , [0,1,  (MX/2 + 1),  (MX/2 + 1)*MY])
    status = DftiSetValue(descrB, DFTI_OUTPUT_STRIDES, [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY])

    ! create forward descriptor
    print *,"Creating DFTI Forward descriptor"
    status = DftiCreateDescriptor(descrF, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ])
    !status = DftiSetValue(descrF, DFTI_NUMBER_OF_USER_THREADS, nThreads)
    status = DftiSetValue(descrF, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX)
    !status = DftiSetValue(descrF, DFTI_NUMBER_OF_TRANSFORMS, MK)
    !status = DftiSetValue(descrF, DFTI_INPUT_DISTANCE , 2*(MX/2 + 1)*MY*MZ)
    !status = DftiSetValue(descrF, DFTI_OUTPUT_DISTANCE,   (MX/2 + 1)*MY*MZ)
    status = DftiSetValue(descrF, DFTI_INPUT_STRIDES , [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY])
    status = DftiSetValue(descrF, DFTI_OUTPUT_STRIDES, [0,1,  (MX/2 + 1),  (MX/2 + 1)*MY])

    status = DftiCommitDescriptor(descrF)
    status = DftiCommitDescriptor(descrB)

    ! perform transforms
    write(6,'("Starting loop...")')
    call system_clock(cStart, cRate, cMax)
    call cpu_time(start)
    do i = 1, NLOOPS
    do k = 1, MK
      status = DftiComputeForward(descrF, array(:,1,1,k))
      status = DftiComputeBackward(descrB, array(:,1,1,k))
    end do
    end do
    call system_clock(cFinish, cRate, cMax)
    call cpu_time(finish)

    ! output results
    write(6,'("  CPU_TIME = "ES26.16)') finish - start
    if (cFinish < cStart) cFinish = cFinish + cMax
    write(6,'("  SYSTEM_CLOCK = "ES26.16)') real(cFinish - cStart)/real(cRate)

    ! cleanup
    status = DftiFreeDescriptor(descrB)
    status = DftiFreeDescriptor(descrF)
  end subroutine TEST_DFT_r2r_3D

 

0 Kudos
Thomas_D_1
Beginner
908 Views

Thanks so much Evgueni! I just made those changes and now it's scaling as expected.

Out of curiosity, is this difference in parallel behavior between DFT with DFTI_NUMBER_OF_TRANSFORMS = 1 and DFT with DFTI_NUMBER_OF_TRANSFORMS  > 1 mentioned in the documentation somewhere? I just took a quick look through the PDF  I have of the MKL reference manual and I didn't notice any mention of it.

0 Kudos
Evgueni_P_Intel
Employee
908 Views

You are right, the documentation advises only on data layouts, number of threads, and thread affinity :)

0 Kudos
Reply