Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Cluster FFT of 1D array using MPI

siruitan
Beginner
437 Views

Dear. Intel Members,

I have a question about distributing data among processes when using Intel MKL cluster FFT functions for 1D transforms. I am testing 1D FFT of an array of six complex numbers. The result obtained using one process is correct. However, the result obtained using two processes has wrong indices (the numbers are correct).  The results and source codes are pasted below. I am confused about how data are distributed among processes, although I have read through the manual. Any advice is greatly appreciated.

Sirui

________________________________________

one process results:

 local_nx=           6 for rank=           0
 local_x_start=           1 for rank=           0
 local_out_nx=           6 for rank=           0
 local_out_x_start=           1 for rank=           0
 local_size=           6 for rank=           0

 input global array
 (0.0000000E+00,1.000000)           1
 (1.000000,1.000000)           2
 (2.000000,1.000000)           3
 (3.000000,1.000000)           4
 (4.000000,1.000000)           5
 (5.000000,1.000000)           6

 output local array with global index
 (15.00000,6.000000)           1 for rank=           0
 (-3.000000,5.196153)           2 for rank=           0
 (-3.000000,1.732051)           3 for rank=           0
 (-3.000000,0.0000000E+00)           4 for rank=           0
 (-3.000000,-1.732051)           5 for rank=           0
 (-3.000000,-5.196152)           6 for rank=           0
 successfully done

Two processes result:

 local_nx=           4 for rank=           0
 local_x_start=           1 for rank=           0
 local_out_nx=           4 for rank=           0
 local_out_x_start=           1 for rank=           0
 local_size=           4 for rank=           0

 input global array
 (0.0000000E+00,1.000000)           1
 (1.000000,1.000000)           2
 (2.000000,1.000000)           3
 (3.000000,1.000000)           4
 (4.000000,1.000000)           5
 (5.000000,1.000000)           6

 local_nx=           2 for rank=           1
 local_x_start=           5 for rank=           1
 local_out_nx=           2 for rank=           1
 local_out_x_start=           5 for rank=           1
 local_size=           3 for rank=           1


 (-3.000000,0.0000000E+00)           5 for rank=           1
 (-3.000000,-1.732051)           6 for rank=           1
 (-3.000000,-5.196152)           7 for rank=           1
 output local array with global index
 (15.00000,6.000000)           1 for rank=           0
 (-3.000000,5.196153)           2 for rank=           0
 (-3.000000,1.732051)           3 for rank=           0
 (-3.000000,-1.732051)           4 for rank=           0
 successfully done

My source code is the following.

      program main
      USE mmpivardef
      USE MKL_CDFT
      USE mpi

      IMPLICIT NONE
      complex(4), allocatable, dimension(:) :: in,work,in_local
      INTEGER(4), parameter :: N=6
      integer(4) :: i,j,localsize
      INTEGER(4) :: status,local_nx,x_start,local_out_nx,out_xstart
      TYPE(DFTI_DESCRIPTOR_DM), POINTER :: My_Desc1_Handle

      CALL MPI_INIT(ierr)
      CALL MPI_COMM_DUP(MPI_COMM_WORLD,MCW,ierr)
      CALL MPI_COMM_RANK(MCW,rank,ierr)
      CALL MPI_COMM_SIZE(MCW,msize,ierr)


      allocate(in(N))

      status = DftiCreateDescriptorDM(MCW,My_Desc1_Handle, &
                DFTI_SINGLE,DFTI_COMPLEX,1,N)
      status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_SIZE,localsize)
      status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_NX,local_nx)
      status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_X_START, x_start)
      status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_OUT_NX, local_out_nx)
      status = DftiGetValueDM(My_Desc1_Handle,CDFT_LOCAL_OUT_X_START, out_xstart)

      write(*,*) 'local_nx=',local_nx,'for rank=',rank
      write(*,*) 'local_x_start=',x_start,'for rank=',rank
      write(*,*) 'local_out_nx=',local_out_nx,'for rank=',rank
      write(*,*) 'local_out_x_start=',out_xstart,'for rank=',rank
      write(*,*) 'local_size=',localsize,'for rank=',rank
      write(*,*)


      ALLOCATE(in_local(localsize))
      ALLOCATE(work(localsize))
      status = DftiSetValueDM(My_Desc1_Handle,CDFT_WORKSPACE,work)


      do i=1,N
        j=i-1
        in(i)=cmplx(j,1)
      enddo
      IF (rank.eq.0) THEN
      write(*,*) 'input global array'
      Do i=1,N
        write(*,*) in(i),i
      ENDDO
      ENDIF
      write(*,*)

      DO i=1,localsize
        in_local(i) = in(i+x_start-1)
      ENDDO

      status = DftiCommitDescriptorDM(My_Desc1_Handle)

      status = DftiComputeForwardDM(My_Desc1_Handle,in_local)

      IF (rank.eq.0) write(*,*) 'output local array with global index'
       DO i=1,localsize
        write(*,*) in_local(i),i+x_start-1,'for rank=',rank
       ENDDO

      status = DftiFreeDescriptorDM(My_Desc1_Handle)

      DEALLOCATE(in_local,work,in)

      IF (rank.eq.0) write(*,*) 'successfully done'
      CALL MPI_FINALIZE(ierr)
      end program


0 Kudos
11 Replies
Chao_Y_Intel
Moderator
437 Views

Sirui, 

Could you suggest which version of Intel MKL you are using now? 

Regards,
Chao 

0 Kudos
siruitan
Beginner
437 Views

Hi Chao,

Thank you for your prompt reply. I have both MKL 10.3.11 and MKL 11.0.1 available. They gave me the same results.

Best,

Sirui

0 Kudos
Evarist_F_Intel
Employee
437 Views

Hello Sirui!

What numbers confused you?

I will try to explain some numbers, probably it will clarify the situation. :)

[rank=0]  local_nx =              4    <-- local_nx usually is approximately N/P, but not always. E.g. if N=128, P=2 then both local_nx is 64. But in this particular case local_nx(rank=0) is not equal to local_nx(rank=1)
[rank=0]  local_x_start=        1
[rank=0]  local_out_nx=        4
[rank=0]  local_out_x_start= 1
[rank=0]  local_size=            4   <-- local_size is approximately equal to local_nx, but may be greater, because of internal CDFT algorithm

[rank=1]  local_nx=               2  <-- local_nx(rank=0) + local_nx(rank=1) = N.
[rank=1]  local_x_start=        5
[rank=1]  local_out_nx=        2
[rank=1]  local_out_x_start= 5
[rank=1]  local_size=             3 <-- e.g. here CDFT needs array of size 3, instead of 2

As it says in "Memory size for local data" section in "distributing data among processes" article sizes of local arrays should be at least local_size. But of cause only first local_nx (local_out_nx) elements are meaningful. Hence

[fortran]

DO i=1,localsize
        in_local(i) = in(i+x_start-1)
ENDDO

DO i=1,localsize
        write(*,*) in_local(i),i+x_start-1,'for rank=',rank
ENDDO

[/fortran]

logically should be replaced with

[fortran]

DO i=1,local_nx
        in_local(i) = in(i+x_start-1)
ENDDO

DO i=1,local_out_nx
        write(*,*) in_local(i),i+outx_start-1,'for rank=',rank
ENDDO

[/fortran]

Anyway, there is an error in the sequence of numbers (rank=0, index = 4 somehow duplicates index=6)... I will dig in it more and write here later.

0 Kudos
siruitan
Beginner
437 Views

Hi EVARIST F.,

Thank you for pointing out the difference between local_size and local_nx/local_out_nx. I agree with you that localsize logically should be replaced by local_nx/local_out_nx in the piece of code you quoted. What really confuses me is the error in the sequence of numbers, as you mentioned at the end of your message. Any advice is greatly appreciated.

Sirui

0 Kudos
Evarist_F_Intel
Employee
437 Views

Sirui,

I finally found out that this is a bug in DftiGetValueDM. Returned values local_out_nx, local_out_x_start are incorrect. For your example with N = 6, correct values should be the following:

[rank=0]  local_nx =              4
[rank=0]  local_x_start=        1
[rank=0]  local_out_nx=        3
[rank=0]  local_out_x_start= 1
[rank=0]  local_size=            4  

[rank=1]  local_nx=               2
[rank=1]  local_x_start=        5
[rank=1]  local_out_nx=        3
[rank=1]  local_out_x_start=  4
[rank=1]  local_size=             3

This is why you got incorrect sequence of numbers (actually rank=0, index = 4 is simply redundant).

Unfortunately there is no easy way to determine what out_nx and out_x_start should be even if you know the length of FFT transform and number of MPI procs. But if number of MPI processes is a perfect square then this problem should not appear. I filed a bug and it will be fixed in our future releases.

Thank you for your help!

0 Kudos
siruitan
Beginner
437 Views

Evarist,

I suspected that the local_out_nx and local_out_x_start returned from DftiGetValueDM were incorrect. Your answer confirmed my guess. Based on my experiments, local_out_nx is always equal to local_nx and local_out_x_start is always equal to local_x_start. What's interesting is that IFFT(FFT(A))==A. That implies somehow DftiComputeBackwardDM is able to figure out the correct data distribution. Anyway, I am looking forward to the new releases with the bug fixed. Thank you for your helpful comments!

Sirui

0 Kudos
Evarist_F_Intel
Employee
437 Views

Sirui,

You are absolutely right about the same values. DftiGetValueDM returns out values using the same formula as for in values. But internal parameters are correct and that is why DftiComputeBackwardDM knows about data distribution.

0 Kudos
Chao_Y_Intel
Moderator
437 Views

Hi Sirui,

You can use the CQ DPD200338629 as reference when the bug fix is ready.

Thanks
Chao 

0 Kudos
Chao_Y_Intel
Moderator
437 Views

Hi Sirui,

could you check the latest MKL 11.0 update? This problem is expected to be fixed in the new releae now.

Regards,
Chao

0 Kudos
siruitan
Beginner
437 Views

Hi Chao,

Could you please specify the release number of the lasted MKL 11.0 update you mentioned? Thank you.

Sirui

0 Kudos
Chao_Y_Intel
Moderator
437 Views

Sirui, 

I had a few clarification: we are planning to include the fix in the next update 5 release (possibly in the MKL 11.0 update 5). but this release is not available yet. I will keep you post when it is ready. Sorry for the confusion. 

Regards,
Chao 

0 Kudos
Reply