- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear. Intel Members,
I have a question about distributing data among processes when using Intel MKL cluster FFT functions for 1D transforms. I am testing 1D FFT of an array of six complex numbers. The result obtained using one process is correct. However, the result obtained using two processes has wrong indices (the numbers are correct). The results and source codes are pasted below. I am confused about how data are distributed among processes, although I have read through the manual. Any advice is greatly appreciated.
Sirui
________________________________________
one process results:
local_nx= 6 for rank= 0
local_x_start= 1 for rank= 0
local_out_nx= 6 for rank= 0
local_out_x_start= 1 for rank= 0
local_size= 6 for rank= 0
input global array
(0.0000000E+00,1.000000) 1
(1.000000,1.000000) 2
(2.000000,1.000000) 3
(3.000000,1.000000) 4
(4.000000,1.000000) 5
(5.000000,1.000000) 6
output local array with global index
(15.00000,6.000000) 1 for rank= 0
(-3.000000,5.196153) 2 for rank= 0
(-3.000000,1.732051) 3 for rank= 0
(-3.000000,0.0000000E+00) 4 for rank= 0
(-3.000000,-1.732051) 5 for rank= 0
(-3.000000,-5.196152) 6 for rank= 0
successfully done
Two processes result:
local_nx= 4 for rank= 0
local_x_start= 1 for rank= 0
local_out_nx= 4 for rank= 0
local_out_x_start= 1 for rank= 0
local_size= 4 for rank= 0
input global array
(0.0000000E+00,1.000000) 1
(1.000000,1.000000) 2
(2.000000,1.000000) 3
(3.000000,1.000000) 4
(4.000000,1.000000) 5
(5.000000,1.000000) 6
local_nx= 2 for rank= 1
local_x_start= 5 for rank= 1
local_out_nx= 2 for rank= 1
local_out_x_start= 5 for rank= 1
local_size= 3 for rank= 1
(-3.000000,0.0000000E+00) 5 for rank= 1
(-3.000000,-1.732051) 6 for rank= 1
(-3.000000,-5.196152) 7 for rank= 1
output local array with global index
(15.00000,6.000000) 1 for rank= 0
(-3.000000,5.196153) 2 for rank= 0
(-3.000000,1.732051) 3 for rank= 0
(-3.000000,-1.732051) 4 for rank= 0
successfully done
My source code is the following.
program main
USE mmpivardef
USE MKL_CDFT
USE mpi
IMPLICIT NONE
complex(4), allocatable, dimension(:) :: in,work,in_local
INTEGER(4), parameter :: N=6
integer(4) :: i,j,localsize
INTEGER(4) :: status,local_nx,x_start,local_out_nx,out_xstart
TYPE(DFTI_DESCRIPTOR_DM), POINTER :: My_Desc1_Handle
CALL MPI_INIT(ierr)
CALL MPI_COMM_DUP(MPI_COMM_WORLD,MCW,ierr)
CALL MPI_COMM_RANK(MCW,rank,ierr)
CALL MPI_COMM_SIZE(MCW,msize,ierr)
allocate(in(N))
status = DftiCreateDescriptorDM(MCW,My_Desc1_Handle, &
DFTI_SINGLE,DFTI_COMPLEX,1,N)
status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_SIZE,localsize)
status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_NX,local_nx)
status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_X_START, x_start)
status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_OUT_NX, local_out_nx)
status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_OUT_X_START, out_xstart)
write(*,*) 'local_nx=',local_nx,'for rank=',rank
write(*,*) 'local_x_start=',x_start,'for rank=',rank
write(*,*) 'local_out_nx=',local_out_nx,'for rank=',rank
write(*,*) 'local_out_x_start=',out_xstart,'for rank=',rank
write(*,*) 'local_size=',localsize,'for rank=',rank
write(*,*)
ALLOCATE(in_local(localsize))
ALLOCATE(work(localsize))
status = DftiSetValueDM(My_Desc1_Handle,CDFT_WORKSPACE,work)
do i=1,N
j=i-1
in(i)=cmplx(j,1)
enddo
IF (rank.eq.0) THEN
write(*,*) 'input global array'
Do i=1,N
write(*,*) in(i),i
ENDDO
ENDIF
write(*,*)
DO i=1,localsize
in_local(i) = in(i+x_start-1)
ENDDO
status = DftiCommitDescriptorDM(My_Desc1_Handle)
status = DftiComputeForwardDM(My_Desc1_Handle,in_local)
IF (rank.eq.0) write(*,*) 'output local array with global index'
DO i=1,localsize
write(*,*) in_local(i),i+x_start-1,'for rank=',rank
ENDDO
status = DftiFreeDescriptorDM(My_Desc1_Handle)
DEALLOCATE(in_local,work,in)
IF (rank.eq.0) write(*,*) 'successfully done'
CALL MPI_FINALIZE(ierr)
end program
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sirui,
Could you suggest which version of Intel MKL you are using now?
Regards,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chao,
Thank you for your prompt reply. I have both MKL 10.3.11 and MKL 11.0.1 available. They gave me the same results.
Best,
Sirui
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Sirui!
What numbers confused you?
I will try to explain some numbers, probably it will clarify the situation. :)
[rank=0] local_nx = 4 <-- local_nx usually is approximately N/P, but not always. E.g. if N=128, P=2 then both local_nx is 64. But in this particular case local_nx(rank=0) is not equal to local_nx(rank=1)
[rank=0] local_x_start= 1
[rank=0] local_out_nx= 4
[rank=0] local_out_x_start= 1
[rank=0] local_size= 4 <-- local_size is approximately equal to local_nx, but may be greater, because of internal CDFT algorithm
[rank=1] local_nx= 2 <-- local_nx(rank=0) + local_nx(rank=1) = N.
[rank=1] local_x_start= 5
[rank=1] local_out_nx= 2
[rank=1] local_out_x_start= 5
[rank=1] local_size= 3 <-- e.g. here CDFT needs array of size 3, instead of 2
As it says in "Memory size for local data" section in "distributing data among processes" article sizes of local arrays should be at least local_size. But of cause only first local_nx (local_out_nx) elements are meaningful. Hence
[fortran]
DO i=1,localsize
in_local(i) = in(i+x_start-1)
ENDDO
DO i=1,localsize
write(*,*) in_local(i),i+x_start-1,'for rank=',rank
ENDDO
[/fortran]
logically should be replaced with
[fortran]
DO i=1,local_nx
in_local(i) = in(i+x_start-1)
ENDDO
DO i=1,local_out_nx
write(*,*) in_local(i),i+outx_start-1,'for rank=',rank
ENDDO
[/fortran]
Anyway, there is an error in the sequence of numbers (rank=0, index = 4 somehow duplicates index=6)... I will dig in it more and write here later.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi EVARIST F.,
Thank you for pointing out the difference between local_size and local_nx/local_out_nx. I agree with you that localsize logically should be replaced by local_nx/local_out_nx in the piece of code you quoted. What really confuses me is the error in the sequence of numbers, as you mentioned at the end of your message. Any advice is greatly appreciated.
Sirui
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sirui,
I finally found out that this is a bug in DftiGetValueDM. Returned values local_out_nx, local_out_x_start are incorrect. For your example with N = 6, correct values should be the following:
[rank=0] local_nx = 4
[rank=0] local_x_start= 1
[rank=0] local_out_nx= 3
[rank=0] local_out_x_start= 1
[rank=0] local_size= 4
[rank=1] local_nx= 2
[rank=1] local_x_start= 5
[rank=1] local_out_nx= 3
[rank=1] local_out_x_start= 4
[rank=1] local_size= 3
This is why you got incorrect sequence of numbers (actually rank=0, index = 4 is simply redundant).
Unfortunately there is no easy way to determine what out_nx and out_x_start should be even if you know the length of FFT transform and number of MPI procs. But if number of MPI processes is a perfect square then this problem should not appear. I filed a bug and it will be fixed in our future releases.
Thank you for your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Evarist,
I suspected that the local_out_nx and local_out_x_start returned from DftiGetValueDM were incorrect. Your answer confirmed my guess. Based on my experiments, local_out_nx is always equal to local_nx and local_out_x_start is always equal to local_x_start. What's interesting is that IFFT(FFT(A))==A. That implies somehow DftiComputeBackwardDM is able to figure out the correct data distribution. Anyway, I am looking forward to the new releases with the bug fixed. Thank you for your helpful comments!
Sirui
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sirui,
You are absolutely right about the same values. DftiGetValueDM returns out values using the same formula as for in values. But internal parameters are correct and that is why DftiComputeBackwardDM knows about data distribution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sirui,
You can use the CQ DPD200338629 as reference when the bug fix is ready.
Thanks
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sirui,
could you check the latest MKL 11.0 update? This problem is expected to be fixed in the new releae now.
Regards,
Chao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Chao,
Could you please specify the release number of the lasted MKL 11.0 update you mentioned? Thank you.
Sirui
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sirui,
I had a few clarification: we are planning to include the fix in the next update 5 release (possibly in the MKL 11.0 update 5). but this release is not available yet. I will keep you post when it is ready. Sorry for the confusion.
Regards,
Chao

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page