Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.
7234 Discussions

Is the real-to-complex transform implemented for cluster FFT?

Jozsef
Beginner
2,056 Views
The manual says, I can set DFTI_REAL as the forward domain in DftiCreateDescriptorDM(). However, the page for DftiComputeForwardDM() says that both the input and output are complex arrays.

Also, there are only complex-to-complex transforms among the examples.

Is the real-to-complex transform implemented for cluster FFT? If so, how do I obtain/calculate the local size of the forward domain?

Thanks,
Jozsef
0 Kudos
11 Replies
Vladimir_Petrov__Int
New Contributor III
2,056 Views
Jozsef,

Cluster FFT supports real-to-complex transforms.
Please use the following pseudo-code as a general guideline:

MKL_LONG n[]={10,20};
DftiCreateDescriptorDM(comm, &handle, DFTI_DOUBLE, DFTI_REAL, 2, n);
DftiCommitDescriptorDM(handle);
MKL_LONG size;
DftiGetValueDM(handle, CDFT_LOCAL_SIZE, &size);
double *inout = (double *)malloc(size*sizeof(double complex));
MKL_LONG nx;
DftiGetValueDM(handle,CDFT_LOCAL_NX,&nx);
MKL_LONG x_start;
DftiGetValueDM(handle, CDFT_LOCAL_X_START, &x_start);
// initialize input data as array of nx x n[1] doubles placed (with possible padding) in array of nx x (n[1]/2+1) double complexes

I hope this helps.

Best regards,
-Vladimir
0 Kudos
Jozsef
Beginner
2,056 Views
Thanks Vladimir,

The above seems to work, i.e. it compiles and runs. However, the following simple test fails to give a reasonably small error on a single cpu. What am I missing?

Thanks,
Jozsef

[cpp]  const MKL_LONG N = 12;
  const MKL_LONG LENGTHS[3] = {N,N,N};
  const MKL_LONG PO = N*N*N;

  double err, maxerr;
  double complex *U, *V, *W;
  DFTI_DESCRIPTOR_DM_HANDLE fft;
  MKL_LONG n, size, local_nx, local_x_start;


  MPI_Init(&argc,&argv);

  DftiCreateDescriptorDM( MPI_COMM_WORLD, &fft, DFTI_DOUBLE, DFTI_REAL, 3, LENGTHS);
  DftiGetValueDM(fft, CDFT_LOCAL_SIZE, &size);
  DftiGetValueDM(fft, CDFT_LOCAL_NX, &local_nx);
  DftiGetValueDM(fft, CDFT_LOCAL_X_START, &local_x_start);
  DftiSetValueDM(fft, DFTI_PLACEMENT, DFTI_INPLACE);
  DftiSetValueDM(fft, DFTI_FORWARD_SCALE, 1.0);
  DftiSetValueDM(fft, DFTI_BACKWARD_SCALE, 1.0/(double)PO);

  U = (double complex*)malloc(size*sizeof(double complex));
  V = (double complex*)malloc(size*sizeof(double complex));
  W = (double complex*)malloc(size*sizeof(double complex));

  DftiSetValueDM(fft, CDFT_WORKSPACE, W);
  DftiCommitDescriptorDM(fft);


  double *u = (double*)U;
  double *v = (double*)V;

  for ( n = 0; n < PO; n++ )
    u = v = (double)n;

  DftiComputeForwardDM(fft, U);
  DftiComputeBackwardDM(fft, U);

  maxerr = 0.0;
  for ( n = 0; n < PO; n++ )
  {
    err = fabs(u-v);
    if (err > maxerr)
      maxerr = err;
  }
  printf("maxerr = %gn",maxerr);


  free( U );
  free( V );
  free( W );

  DftiFreeDescriptorDM(&fft);

  MPI_Finalize();[/cpp]

0 Kudos
Jozsef
Beginner
2,056 Views
Interestingly, the above example works for a complex forward domain (and the loops going until 2*PO).
J
0 Kudos
Vladimir_Petrov__Int
New Contributor III
2,056 Views
Jozsef,

I cannot run your code right now, but there is a couple of mistakes. Please find my comments/corrections below below:

  1. constMKL_LONGN=12;
  2. constMKL_LONGLENGTHS[3]={N,N,N}; // comment: this are global sizes - not local
  3. constMKL_LONGPO=N*N*N; // comment: hence this is also global
  4. doubleerr,maxerr;
  5. doublecomplex*U,*V,*W;
  6. DFTI_DESCRIPTOR_DM_HANDLEfft;
  7. MKL_LONGn,size,local_nx,local_x_start;
  8. MPI_Init(&argc,&argv);
  9. DftiCreateDescriptorDM(MPI_COMM_WORLD,&fft,DFTI_DOUBLE,DFTI_REAL,3,LENGTHS);
  10. DftiGetValueDM(fft,CDFT_LOCAL_SIZE,&size); // comment: the size returned here is local to this process
  11. DftiGetValueDM(fft,CDFT_LOCAL_NX,&local_nx); // comment: this is how many DPs along the most strided dimension this process will own
  12. DftiGetValueDM(fft,CDFT_LOCAL_X_START,&local_x_start); // comment: this is at which DP the process's DPs along the most strided dimension start
  13. DftiSetValueDM(fft,DFTI_PLACEMENT,DFTI_INPLACE);
  14. DftiSetValueDM(fft,DFTI_FORWARD_SCALE,1.0);
  15. DftiSetValueDM(fft,DFTI_BACKWARD_SCALE,1.0/(double)PO);
  16. U=(doublecomplex*)malloc(size*sizeof(doublecomplex));
  17. V=(doublecomplex*)malloc(size*sizeof(doublecomplex));
  18. W=(doublecomplex*)malloc(size*sizeof(doublecomplex));
  19. DftiSetValueDM(fft,CDFT_WORKSPACE,W);
  20. DftiCommitDescriptorDM(fft);
  21. double*u=(double*)U;
  22. double*v=(double*)V;
  23. for(n=0;n
  24. u=v=(double)n; // correction: the size obtained in line 14 is not neccessarily equal to P0 (please look at my previous post). You need to make it a doubly nested loop going over an array of local_nx x (LENGTHES[1] * (LENGTHES[2]/2 + 1)) reals
  25. DftiComputeForwardDM(fft,U);
  26. DftiComputeBackwardDM(fft,U);
  27. maxerr=0.0;
  28. for(n=0;n
  29. {
  30. err=fabs(u-v);
  31. if(err>maxerr)
  32. maxerr=err;
  33. }
  34. printf("maxerr=%g\n",maxerr);
  35. free(U);
  36. free(V);
  37. free(W);
  38. DftiFreeDescriptorDM(&fft);
  39. MPI_Finalize();
I hope this helps.

Best regards,
-Vladimir
0 Kudos
Jozsef
Beginner
2,056 Views
Yes. For more than one process, I would compute the end of the loops as "size / (N/2+1) * N", which is the number of doubles the given process owns -- as I understand.

This is the same as PO for a single process, since in that case local_nx = N. Correct?

With the above in mind, the behavior is the same.

J
0 Kudos
Vladimir_Petrov__Int
New Contributor III
2,056 Views
First, you correctly assume that for a single MPI proc (mpiexec -np 1 ...) local_nx will be equal to global N.

Next, size >= local_nx * N * (N/2 + 1) - for some transforms a larger buffer is required, hence the equality not necessarily holds.

Lastly, your data have to be layed out correctly (so called CCE layout) and that is done as follows:

double *input;
double complex *output = input;

int n2_padded = 2*(N/2 + 1);
int i0, i1, i2;

// init input
for (i0=0; i0 for (i1=0; i1 for (i2=0; i2 input[(i0*N + i1)*n2_padded + i2] = get_data_from_input_global_array_at(local_start_x + i0, i1, i2);
}
}
}

// do fft

// use the computed data
for (i0=0; i0 for (i1=0; i1 for (i2=0; i2 put_computed_data_into_output_global_array_at(local_start_x_out + i0, i1, i2) = output[(i0*N + i1)*(N/2 + 1) + i2] = ;
}
}
}

best regards,
-Vladimir
0 Kudos
Jozsef
Beginner
2,056 Views
This is fine, but I'm trying to get something simpler right first: on a single process (no assembly) and I'm simply trying to get the initial data back after a forward and a backward fft.

For example:

(1) Generate a bunch of random doubles into the NxNxN array U and make a copy in V.

(2) Do an in-place forward and backward FFT on U.

(3) Compare U and V, which should be the same (modulo floating point errors), irrespective of the (real, i.e. not complex) data storage between the transforms.

I tried this for non-cluster fft and it works for both real and complex forward domains. However, for cluster fft it only works for a complex domain and doesn't for a real one.

J
0 Kudos
Vladimir_Petrov__Int
New Contributor III
2,056 Views
Jozsef,

First of all, Cluster FFT is not to be used on a single MPI process. It's the DFTI's (non-cluster FFT's) realm.
Second of all, it might be that this case does not even work with CFFT. I will double check and report back soon.

Best regards,
-Vladimir
0 Kudos
Jozsef
Beginner
2,056 Views
Ok. The problem was not Cluster FFT-related. Rather, I didn't correctly test the array in CCE format.

In particular, the loop
[cpp]  maxerr = 0.0;
  for ( n = 0; n < PO; n++ )
  {
    err = fabs(u-v);
    if (err > maxerr)
      maxerr = err;
  }
  printf("maxerr = %gn",maxerr);[/cpp]
is only fine for testing contiguous data, such as result from complex-to-complex transforms. The CCE format has some unused blocks, at the end of each row (2 doubles), which should not be compared.

Instead of the above, I need something like
[cpp]  maxerr = 0.0;
  for ( n = 0; n < N*N; n++ )
    for ( m = 0; m < N; m++ )
    {
      err = fabs(u[n*2*(N/2+1)+m] - v[n*2*(N/2+1)+m]);
      if (err > maxerr)
        maxerr = err;
     }
  printf("maxerr = %gn",maxerr);[/cpp]
which gives the correct small error. Now all this works for dfti, next is to get it right for cdft, as well.

Thanks,
J
0 Kudos
Jozsef
Beginner
2,056 Views
As I suspected, the issue was the same for cluster fft.

J
0 Kudos
Vladimir_Petrov__Int
New Contributor III
2,056 Views
Jozsef,

I'm glad to hear that the Good vanquished the evil (once again).

Best regards,
-Vladimir
0 Kudos
Reply