- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm working on a program that performs several 3 x 3d (N1xN2xN3) DFTs using the MKL DFT algorithm. I'm running most of the program in parallel using OpenMP and I'd like to get as much parallel performance from the DFT section as well as it accounts for a significant portion of the programs runtime. However when I try to increase the number of threads I find that the performance improvement plateaus at 3 threads, i.e., the number of transforms for each call. If instead I break up the transform into 3xN1 2d transforms the parallel performance continues to scale beyond 3 threads. This seems like a lot of extra work for performance gains I would expect to be handled internally. Is there a way of directing MKL's DFT to do this on it's own?
As it may be relevant, I'm already passing the number of available threads to the DFT via the DFTI_NUMBER_OF_THREADS DftiSetValue option and each of the 3 3d tranforms is done by setting the DFTI_NUMBER_OF_TRANSFORMS option. I can provide some pseudo code if that would be helpful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
The DFTI_NUMBER_OF_TRANSFORMS setting is tuned for large transposed batches.
To improve scaling in your case, I have changed the descriptors to do 1 transform at a time in a loop in TEST_DFT_r2r_3D.
I have also moved DftiCommitDescriptor out of the loop.
Now it scales.
Evgueni.
!========================================================================================= ! TEST_DFT_r2r_3D: ! ! Performs a number of forward and backward DFTs using 3d transforms. ! subroutine TEST_DFT_r2r_3D( array, MK, MX, MY, MZ, NLOOPS ) real(DFT_RK), intent(inout) :: array(:,:,:,:) integer(IK) , intent(in) :: MK, MX, MY, MZ integer(IK) , intent(in) :: NLOOPS ! local variables: integer(IK) :: cFinish, cStart, cRate, cMax type(DFTI_DESCRIPTOR), pointer :: descrB, descrF integer(IK) :: i, k integer(IK) :: status real(RK) :: start, finish, time ! create backward descriptor print *,"Creating DFTI Backward descriptor" status = DftiCreateDescriptor(descrB, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ]) !status = DftiSetValue(descrB, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(descrB, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) status = DftiSetValue(descrB, DFTI_BACKWARD_SCALE, 1.0/(MX*MY*MZ)) !status = DftiSetValue(descrB, DFTI_NUMBER_OF_TRANSFORMS, MK) !status = DftiSetValue(descrB, DFTI_INPUT_DISTANCE , (MX/2 + 1)*MY*MZ) !status = DftiSetValue(descrB, DFTI_OUTPUT_DISTANCE, 2*(MX/2 + 1)*MY*MZ) status = DftiSetValue(descrB, DFTI_INPUT_STRIDES , [0,1, (MX/2 + 1), (MX/2 + 1)*MY]) status = DftiSetValue(descrB, DFTI_OUTPUT_STRIDES, [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY]) ! create forward descriptor print *,"Creating DFTI Forward descriptor" status = DftiCreateDescriptor(descrF, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ]) !status = DftiSetValue(descrF, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(descrF, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) !status = DftiSetValue(descrF, DFTI_NUMBER_OF_TRANSFORMS, MK) !status = DftiSetValue(descrF, DFTI_INPUT_DISTANCE , 2*(MX/2 + 1)*MY*MZ) !status = DftiSetValue(descrF, DFTI_OUTPUT_DISTANCE, (MX/2 + 1)*MY*MZ) status = DftiSetValue(descrF, DFTI_INPUT_STRIDES , [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY]) status = DftiSetValue(descrF, DFTI_OUTPUT_STRIDES, [0,1, (MX/2 + 1), (MX/2 + 1)*MY]) status = DftiCommitDescriptor(descrF) status = DftiCommitDescriptor(descrB) ! perform transforms write(6,'("Starting loop...")') call system_clock(cStart, cRate, cMax) call cpu_time(start) do i = 1, NLOOPS do k = 1, MK status = DftiComputeForward(descrF, array(:,1,1,k)) status = DftiComputeBackward(descrB, array(:,1,1,k)) end do end do call system_clock(cFinish, cRate, cMax) call cpu_time(finish) ! output results write(6,'(" CPU_TIME = "ES26.16)') finish - start if (cFinish < cStart) cFinish = cFinish + cMax write(6,'(" SYSTEM_CLOCK = "ES26.16)') real(cFinish - cStart)/real(cRate) ! cleanup status = DftiFreeDescriptor(descrB) status = DftiFreeDescriptor(descrF) end subroutine TEST_DFT_r2r_3D
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
Please post the pseudo code here.
Evgueni.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've included the basic code layout below. I've tried running this kind of program with the number of OpenMP threads ranging from 1 to 8 and the runtime stops improving after 3 threads.
program main ! defining variables and stuff real :: array(2*(Nx/2 + 1),Ny,Nz,3) ... nThreads = OMP_GET_NUM_THREADS() call OMP_SET_NESTED(.TRUE.) ! create backward descriptor status = DftiCreateDescriptor(descr, DFTI_SINGLE, DFTI_REAL, 3, [Nx,Ny,Nz]) status = DftiSetValue(Bdescr, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(Bdescr, DFTI_NUMBER_OF_TRANSFORMS, 3) status = DftiSetValue(Bdescr, DFTI_INPUT_DISTANCE, 2*(Nx/2 + 1)*Ny*Nz) status = DftiSetValue(Bdescr, DFTI_OUTPUT_DISTANCE, 2*(Nx/2 + 1)*Ny*Nz) status = DftiSetValue(Bdescr, DFTI_INPUT_STRIDES, [0,1,2*(Nx/2 + 1),2*(Nx/2 + 1)*Ny]) status = DftiSetValue(Bdescr, DFTI_OUTPUT_STRIDES, [0,1,2*(Nx/2 + 1),2*(Nx/2 + 1)*Ny]) ! create forward descriptor status = DftiCreateDescriptor(descr, DFTI_SINGLE, DFTI_REAL, 3, [Nx,Ny,Nz]) status = DftiSetValue(Fdescr, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(Fdescr, DFTI_NUMBER_OF_TRANSFORMS, 3) status = DftiSetValue(Fdescr, DFTI_INPUT_DISTANCE, 2*(Nx/2 + 1)*Ny*Nz) status = DftiSetValue(Fdescr, DFTI_OUTPUT_DISTANCE, (Nx/2 + 1)*Ny*Nz) status = DftiSetValue(Fdescr, DFTI_INPUT_STRIDES, [0,1,2*(Nx/2 + 1),2*(Nx/2 + 1)*Ny]) status = DftiSetValue(Fdescr, DFTI_OUTPUT_STRIDES, [0,1,(Nx/2 + 1),(Nx/2 + 1)*Ny]) ! initialize array array = 1.0 ! do many loops of forward and backward DFT ! start timer do i = 1, nLoops status = DftiCommitDescriptor(Fdescr) status = DftiComputeForward(Fdescr, array(:,1,1,1)) status = DftiCommitDescriptor(Bdescr) status = DftiComputeBackward(Bdescr, array(:,1,1,1)) end do ! end timer end program main
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
You mentioned "I'm running most of the program in parallel using OpenMP". so as I understand, you should have some OpenMP external thread at DFT call or aka nested parallel model. But it seems not in the code.
It is ok DFTI_NUMBER_OF_TRANSFORMS, 3 as you will 3 x 3d (N1xN2xN3) DFTs. so your performance result are based on 3 threads.
But DFTI_NUMBER_OF_USER_THREADS should be no used at latest version
DFTI_NUMBER_OF_USER_THREADS
The DFTI_NUMBER_OF_USER_THREADS configuration parameter is no longer used and kept for compatibility
with previous versions of Intel MKL.
Could you please tell what MKL version and command line and your hardware configuration?
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the reply. The DFT in my program isn't embedded in a parallel section. But I would like to run it in parallel as it accounts for a significant percent of the total calculation. The structure of the programs looks like this
program main ! initialization stuff ... do ! perform lots of parallel calculations ... ! perform forward DFT call DFTForward(descrF, array(:,1,1,1)) ! perform parallel calculations with DFT of array ... ! perform backward DFT call DFTBackward(descrB, array(:,1,1,1)) ! perform more parallel calculations ... end do end program main
I'm using ifort version 14.0.2 20140120. Not sure which version of MKL that has. As for hardware I'm working on a Linux cluster with AMD Opteron 6212 Processors each with 16 CPUs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
Thanks for the clarification. You haven't show how you compile the code and what the exact performance data vs threading number, not sure i get all of them, but please try.
1. Could you remove the limitation status
= DftiSetValue(Bdescr, DFTI_NUMBER_OF_USER_THREADS, nThreads) and try again?
Please let us know the result.
2. In addition, there are sample code under MKL install directory. MKLexample\dftf\source\config_number_of_transforms.f90
and config_number_of_user_threads.f90. you may test the performance of MKL DFT part. then consider total parallel part?
3. Or may you try insert OMP threading control code in the parallel calculation,of your code and obverse the behavious of thread number
do |
06 |
call omp_set_num_threads(8) //8 physical core, right?
|
07 |
... |
08 |
09 |
! perform forward DFT |
call omp_set_num_threads(3)
10 |
call DFTForward(descrF, array(:,1,1,1)) |
11 |
|
12 |
! perform parallel calculations with DFT of array |
13 |
... |
14 |
15 |
! perform backward DFT |
16 |
call DFTBackward(descrB, array(:,1,1,1)) |
call omp_set_num_threads(8)
17 |
18 |
! perform more parallel calculations |
19 |
... |
20 |
end do |
Please let us know the result.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since pseudo-code doesn't appear to be doing the trick I boiled my program down to a simple example listed here
!========================================================================================= program main !========================================================================================= use, intrinsic :: iso_fortran_env use mkl_dfti use omp_lib implicit none ! local parameters: integer(IK), parameter :: IK = 4 integer(IK), parameter :: RK = 8 integer(IK), parameter :: DFT_CK = 4 integer(IK), parameter :: DFT_RK = 4 integer(IK), parameter :: NLOOPS = 100 integer(IK), parameter :: NX = 100 integer(IK), parameter :: NY = 40 integer(IK), parameter :: NZ = 60 integer(IK), parameter :: MX = 2*NX - 1 integer(IK), parameter :: MY = 2*NY - 1 integer(IK), parameter :: MZ = 2*NZ - 1 ! local variables: real(DFT_RK), allocatable :: array(:,:,:,:) integer(IK) :: status write(6,'("Number of processors = "I0)') OMP_GET_NUM_PROCS() write(6,'("Maximum number of available threads = "I0)') OMP_GET_MAX_THREADS() write(6,'(A)') '' write(6,'("Generating array...")') allocate(array(2*(MX/2+1),MY,MZ,3)) array = 1.0 write(6,'(A)') '' write(6,'("TEST DFT_r2r_3D...")') write(6,'("Number of OMP threads = "I0)') 1 call omp_set_num_threads(1) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 2 call omp_set_num_threads(2) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 3 call omp_set_num_threads(3) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 4 call omp_set_num_threads(4) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 5 call omp_set_num_threads(5) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 6 call omp_set_num_threads(6) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 7 call omp_set_num_threads(7) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' write(6,'("Number of OMP threads = "I0)') 8 call omp_set_num_threads(8) call TEST_DFT_r2r_3D(array, 3, MX, MY, MZ, NLOOPS) write(6,'(A)') '' contains !========================================================================================= ! TEST_DFT_r2r_3D: ! ! Performs a number of forward and backward DFTs using 3d transforms. ! subroutine TEST_DFT_r2r_3D( array, MK, MX, MY, MZ, NLOOPS ) real(DFT_RK), intent(inout) :: array(:,:,:,:) integer(IK) , intent(in) :: MK, MX, MY, MZ integer(IK) , intent(in) :: NLOOPS ! local variables: integer(IK) :: cFinish, cStart, cRate, cMax type(DFTI_DESCRIPTOR), pointer :: descrB, descrF integer(IK) :: i integer(IK) :: status real(RK) :: start, finish, time ! create backward descriptor print *,"Creating DFTI Backward descriptor" status = DftiCreateDescriptor(descrB, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ]) !status = DftiSetValue(descrB, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(descrB, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) status = DftiSetValue(descrB, DFTI_BACKWARD_SCALE, 1.0/(MX*MY*MZ)) status = DftiSetValue(descrB, DFTI_NUMBER_OF_TRANSFORMS, MK) status = DftiSetValue(descrB, DFTI_INPUT_DISTANCE , (MX/2 + 1)*MY*MZ) status = DftiSetValue(descrB, DFTI_OUTPUT_DISTANCE, 2*(MX/2 + 1)*MY*MZ) status = DftiSetValue(descrB, DFTI_INPUT_STRIDES , [0,1, (MX/2 + 1), (MX/2 + 1)*MY]) status = DftiSetValue(descrB, DFTI_OUTPUT_STRIDES, [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY]) ! create forward descriptor print *,"Creating DFTI Forward descriptor" status = DftiCreateDescriptor(descrF, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ]) !status = DftiSetValue(descrF, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(descrF, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) status = DftiSetValue(descrF, DFTI_NUMBER_OF_TRANSFORMS, MK) status = DftiSetValue(descrF, DFTI_INPUT_DISTANCE , 2*(MX/2 + 1)*MY*MZ) status = DftiSetValue(descrF, DFTI_OUTPUT_DISTANCE, (MX/2 + 1)*MY*MZ) status = DftiSetValue(descrF, DFTI_INPUT_STRIDES , [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY]) status = DftiSetValue(descrF, DFTI_OUTPUT_STRIDES, [0,1, (MX/2 + 1), (MX/2 + 1)*MY]) ! perform transforms write(6,'("Starting loop...")') call system_clock(cStart, cRate, cMax) call cpu_time(start) do i = 1, NLOOPS status = DftiCommitDescriptor(descrF) status = DftiComputeForward(descrF, array(:,1,1,1)) status = DftiCommitDescriptor(descrB) status = DftiComputeBackward(descrB, array(:,1,1,1)) end do call system_clock(cFinish, cRate, cMax) call cpu_time(finish) ! output results write(6,'(" CPU_TIME = "ES26.16)') finish - start if (cFinish < cStart) cFinish = cFinish + cMax write(6,'(" SYSTEM_CLOCK = "ES26.16)') real(cFinish - cStart)/real(cRate) ! cleanup status = DftiFreeDescriptor(descrB) status = DftiFreeDescriptor(descrF) end subroutine TEST_DFT_r2r_3D !========================================================================================= end program main !=========================================================================================
I'm compiling with the compiler flags
FFLAGS = -O3 -msse2 -openmp -parallel
Compiling and running this program gives
Number of processors = 16 Maximum number of available threads = 8 Generating array... TEST DFT_r2r_3D... Number of OMP threads = 1 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.7115277000000006E+01 SYSTEM_CLOCK = 7.5663902282714844E+01 Number of OMP threads = 2 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5902460999999988E+01 SYSTEM_CLOCK = 3.7973400115966797E+01 Number of OMP threads = 3 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5313551000000018E+01 SYSTEM_CLOCK = 2.5120399475097656E+01 Number of OMP threads = 4 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5515520000000009E+01 SYSTEM_CLOCK = 2.5186500549316406E+01 Number of OMP threads = 5 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5352544999999964E+01 SYSTEM_CLOCK = 2.5134099960327148E+01 Number of OMP threads = 6 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5456527999999992E+01 SYSTEM_CLOCK = 2.5168199539184570E+01 Number of OMP threads = 7 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5415534999999977E+01 SYSTEM_CLOCK = 2.5154600143432617E+01 Number of OMP threads = 8 Creating DFTI Backward descriptor Creating DFTI Forward descriptor Starting loop... CPU_TIME = 7.5428532999999902E+01 SYSTEM_CLOCK = 2.5156799316406250E+01
So the program runtime plateaus at about 25 seconds after only 3 threads are given to OpenMP.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Thomas,
The DFTI_NUMBER_OF_TRANSFORMS setting is tuned for large transposed batches.
To improve scaling in your case, I have changed the descriptors to do 1 transform at a time in a loop in TEST_DFT_r2r_3D.
I have also moved DftiCommitDescriptor out of the loop.
Now it scales.
Evgueni.
!========================================================================================= ! TEST_DFT_r2r_3D: ! ! Performs a number of forward and backward DFTs using 3d transforms. ! subroutine TEST_DFT_r2r_3D( array, MK, MX, MY, MZ, NLOOPS ) real(DFT_RK), intent(inout) :: array(:,:,:,:) integer(IK) , intent(in) :: MK, MX, MY, MZ integer(IK) , intent(in) :: NLOOPS ! local variables: integer(IK) :: cFinish, cStart, cRate, cMax type(DFTI_DESCRIPTOR), pointer :: descrB, descrF integer(IK) :: i, k integer(IK) :: status real(RK) :: start, finish, time ! create backward descriptor print *,"Creating DFTI Backward descriptor" status = DftiCreateDescriptor(descrB, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ]) !status = DftiSetValue(descrB, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(descrB, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) status = DftiSetValue(descrB, DFTI_BACKWARD_SCALE, 1.0/(MX*MY*MZ)) !status = DftiSetValue(descrB, DFTI_NUMBER_OF_TRANSFORMS, MK) !status = DftiSetValue(descrB, DFTI_INPUT_DISTANCE , (MX/2 + 1)*MY*MZ) !status = DftiSetValue(descrB, DFTI_OUTPUT_DISTANCE, 2*(MX/2 + 1)*MY*MZ) status = DftiSetValue(descrB, DFTI_INPUT_STRIDES , [0,1, (MX/2 + 1), (MX/2 + 1)*MY]) status = DftiSetValue(descrB, DFTI_OUTPUT_STRIDES, [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY]) ! create forward descriptor print *,"Creating DFTI Forward descriptor" status = DftiCreateDescriptor(descrF, DFTI_SINGLE, DFTI_REAL, 3, [MX,MY,MZ]) !status = DftiSetValue(descrF, DFTI_NUMBER_OF_USER_THREADS, nThreads) status = DftiSetValue(descrF, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX) !status = DftiSetValue(descrF, DFTI_NUMBER_OF_TRANSFORMS, MK) !status = DftiSetValue(descrF, DFTI_INPUT_DISTANCE , 2*(MX/2 + 1)*MY*MZ) !status = DftiSetValue(descrF, DFTI_OUTPUT_DISTANCE, (MX/2 + 1)*MY*MZ) status = DftiSetValue(descrF, DFTI_INPUT_STRIDES , [0,1,2*(MX/2 + 1),2*(MX/2 + 1)*MY]) status = DftiSetValue(descrF, DFTI_OUTPUT_STRIDES, [0,1, (MX/2 + 1), (MX/2 + 1)*MY]) status = DftiCommitDescriptor(descrF) status = DftiCommitDescriptor(descrB) ! perform transforms write(6,'("Starting loop...")') call system_clock(cStart, cRate, cMax) call cpu_time(start) do i = 1, NLOOPS do k = 1, MK status = DftiComputeForward(descrF, array(:,1,1,k)) status = DftiComputeBackward(descrB, array(:,1,1,k)) end do end do call system_clock(cFinish, cRate, cMax) call cpu_time(finish) ! output results write(6,'(" CPU_TIME = "ES26.16)') finish - start if (cFinish < cStart) cFinish = cFinish + cMax write(6,'(" SYSTEM_CLOCK = "ES26.16)') real(cFinish - cStart)/real(cRate) ! cleanup status = DftiFreeDescriptor(descrB) status = DftiFreeDescriptor(descrF) end subroutine TEST_DFT_r2r_3D
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks so much Evgueni! I just made those changes and now it's scaling as expected.
Out of curiosity, is this difference in parallel behavior between DFT with DFTI_NUMBER_OF_TRANSFORMS = 1 and DFT with DFTI_NUMBER_OF_TRANSFORMS > 1 mentioned in the documentation somewhere? I just took a quick look through the PDF I have of the MKL reference manual and I didn't notice any mention of it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are right, the documentation advises only on data layouts, number of threads, and thread affinity :)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page