Is the real-to-complex transform implemented for cluster FFT?

Jozsef · ‎11-18-2010

The manual says, I can set DFTI_REAL as the forward domain in DftiCreateDescriptorDM(). However, the page for DftiComputeForwardDM() says that both the input and output are complex arrays.

Also, there are only complex-to-complex transforms among the examples.

Is the real-to-complex transform implemented for cluster FFT? If so, how do I obtain/calculate the local size of the forward domain?

Thanks,
Jozsef

Vladimir_Petrov__Int · ‎11-22-2010

Jozsef,

Cluster FFT supports real-to-complex transforms.
Please use the following pseudo-code as a general guideline:

MKL_LONG n[]={10,20};
DftiCreateDescriptorDM(comm, &handle, DFTI_DOUBLE, DFTI_REAL, 2, n);
DftiCommitDescriptorDM(handle);
MKL_LONG size;
DftiGetValueDM(handle, CDFT_LOCAL_SIZE, &size);
double *inout = (double *)malloc(size*sizeof(double complex));
MKL_LONG nx;
DftiGetValueDM(handle,CDFT_LOCAL_NX,&nx);
MKL_LONG x_start;
DftiGetValueDM(handle, CDFT_LOCAL_X_START, &x_start);
// initialize input data as array of nx x n[1] doubles placed (with possible padding) in array of nx x (n[1]/2+1) double complexes

I hope this helps.

Best regards,
-Vladimir

Jozsef · ‎02-10-2011

Thanks Vladimir,

The above seems to work, i.e. it compiles and runs. However, the following simple test fails to give a reasonably small error on a single cpu. What am I missing?

Thanks,
Jozsef

[cpp]  const MKL_LONG N = 12;
  const MKL_LONG LENGTHS[3] = {N,N,N};
  const MKL_LONG PO = N*N*N;

  double err, maxerr;
  double complex *U, *V, *W;
  DFTI_DESCRIPTOR_DM_HANDLE fft;
  MKL_LONG n, size, local_nx, local_x_start;


  MPI_Init(&argc,&argv);

  DftiCreateDescriptorDM( MPI_COMM_WORLD, &fft, DFTI_DOUBLE, DFTI_REAL, 3, LENGTHS);
  DftiGetValueDM(fft, CDFT_LOCAL_SIZE, &size);
  DftiGetValueDM(fft, CDFT_LOCAL_NX, &local_nx);
  DftiGetValueDM(fft, CDFT_LOCAL_X_START, &local_x_start);
  DftiSetValueDM(fft, DFTI_PLACEMENT, DFTI_INPLACE);
  DftiSetValueDM(fft, DFTI_FORWARD_SCALE, 1.0);
  DftiSetValueDM(fft, DFTI_BACKWARD_SCALE, 1.0/(double)PO);

  U = (double complex*)malloc(size*sizeof(double complex));
  V = (double complex*)malloc(size*sizeof(double complex));
  W = (double complex*)malloc(size*sizeof(double complex));

  DftiSetValueDM(fft, CDFT_WORKSPACE, W);
  DftiCommitDescriptorDM(fft);


  double *u = (double*)U;
  double *v = (double*)V;

  for ( n = 0; n < PO; n++ )
    u = v = (double)n;

  DftiComputeForwardDM(fft, U);
  DftiComputeBackwardDM(fft, U);

  maxerr = 0.0;
  for ( n = 0; n < PO; n++ )
  {
    err = fabs(u-v);
    if (err > maxerr)
      maxerr = err;
  }
  printf("maxerr = %gn",maxerr);


  free( U );
  free( V );
  free( W );

  DftiFreeDescriptorDM(&fft);

  MPI_Finalize();[/cpp]

Jozsef · ‎02-10-2011

Interestingly, the above example works for a complex forward domain (and the loops going until 2*PO).
J

Vladimir_Petrov__Int · ‎02-10-2011

Jozsef,

I cannot run your code right now, but there is a couple of mistakes. Please find my comments/corrections below below:

constMKL_LONGN=12;
constMKL_LONGLENGTHS[3]={N,N,N}; // comment: this are global sizes - not local
constMKL_LONGPO=N*N*N; // comment: hence this is also global
doubleerr,maxerr;
doublecomplex*U,*V,*W;
DFTI_DESCRIPTOR_DM_HANDLEfft;
MKL_LONGn,size,local_nx,local_x_start;
MPI_Init(&argc,&argv);
DftiCreateDescriptorDM(MPI_COMM_WORLD,&fft,DFTI_DOUBLE,DFTI_REAL,3,LENGTHS);
DftiGetValueDM(fft,CDFT_LOCAL_SIZE,&size); // comment: the size returned here is local to this process
DftiGetValueDM(fft,CDFT_LOCAL_NX,&local_nx); // comment: this is how many DPs along the most strided dimension this process will own
DftiGetValueDM(fft,CDFT_LOCAL_X_START,&local_x_start); // comment: this is at which DP the process's DPs along the most strided dimension start
DftiSetValueDM(fft,DFTI_PLACEMENT,DFTI_INPLACE);
DftiSetValueDM(fft,DFTI_FORWARD_SCALE,1.0);
DftiSetValueDM(fft,DFTI_BACKWARD_SCALE,1.0/(double)PO);
U=(doublecomplex*)malloc(size*sizeof(doublecomplex));
V=(doublecomplex*)malloc(size*sizeof(doublecomplex));
W=(doublecomplex*)malloc(size*sizeof(doublecomplex));
DftiSetValueDM(fft,CDFT_WORKSPACE,W);
DftiCommitDescriptorDM(fft);
double*u=(double*)U;
double*v=(double*)V;
for(n=0;n
u=v=(double)n; // correction: the size obtained in line 14 is not neccessarily equal to P0 (please look at my previous post). You need to make it a doubly nested loop going over an array of local_nx x (LENGTHES[1] * (LENGTHES[2]/2 + 1)) reals
DftiComputeForwardDM(fft,U);
DftiComputeBackwardDM(fft,U);
maxerr=0.0;
for(n=0;n
{
err=fabs(u-v);
if(err>maxerr)
maxerr=err;
}
printf("maxerr=%g\n",maxerr);
free(U);
free(V);
free(W);
DftiFreeDescriptorDM(&fft);
MPI_Finalize();

I hope this helps.

Best regards,
-Vladimir

Jozsef · ‎02-10-2011

Yes. For more than one process, I would compute the end of the loops as "size / (N/2+1) * N", which is the number of doubles the given process owns -- as I understand.

This is the same as PO for a single process, since in that case local_nx = N. Correct?

With the above in mind, the behavior is the same.

J

Vladimir_Petrov__Int · ‎02-11-2011

First, you correctly assume that for a single MPI proc (mpiexec -np 1 ...) local_nx will be equal to global N.

Next, size >= local_nx * N * (N/2 + 1) - for some transforms a larger buffer is required, hence the equality not necessarily holds.

Lastly, your data have to be layed out correctly (so called CCE layout) and that is done as follows:

double *input;
double complex *output = input;

int n2_padded = 2*(N/2 + 1);
int i0, i1, i2;

// init input
for (i0=0; i0 for (i1=0; i1 for (i2=0; i2 input[(i0*N + i1)*n2_padded + i2] = get_data_from_input_global_array_at(local_start_x + i0, i1, i2);
}
}
}

// do fft

// use the computed data
for (i0=0; i0 for (i1=0; i1 for (i2=0; i2 put_computed_data_into_output_global_array_at(local_start_x_out + i0, i1, i2) = output[(i0*N + i1)*(N/2 + 1) + i2] = ;
}
}
}

best regards,
-Vladimir

Jozsef · ‎02-11-2011

This is fine, but I'm trying to get something simpler right first: on a single process (no assembly) and I'm simply trying to get the initial data back after a forward and a backward fft.

For example:

(1) Generate a bunch of random doubles into the NxNxN array U and make a copy in V.

(2) Do an in-place forward and backward FFT on U.

(3) Compare U and V, which should be the same (modulo floating point errors), irrespective of the (real, i.e. not complex) data storage between the transforms.

I tried this for non-cluster fft and it works for both real and complex forward domains. However, for cluster fft it only works for a complex domain and doesn't for a real one.

J

Vladimir_Petrov__Int · ‎02-11-2011

Jozsef,

First of all, Cluster FFT is not to be used on a single MPI process. It's the DFTI's (non-cluster FFT's) realm.
Second of all, it might be that this case does not even work with CFFT. I will double check and report back soon.

Best regards,
-Vladimir

Jozsef · ‎02-11-2011

Ok. The problem was not Cluster FFT-related. Rather, I didn't correctly test the array in CCE format.

In particular, the loop

[cpp]  maxerr = 0.0;
  for ( n = 0; n < PO; n++ )
  {
    err = fabs(u-v);
    if (err > maxerr)
      maxerr = err;
  }
  printf("maxerr = %gn",maxerr);[/cpp]

is only fine for testing contiguous data, such as result from complex-to-complex transforms. The CCE format has some unused blocks, at the end of each row (2 doubles), which should not be compared.

Instead of the above, I need something like

[cpp]  maxerr = 0.0;
  for ( n = 0; n < N*N; n++ )
    for ( m = 0; m < N; m++ )
    {
      err = fabs(u[n*2*(N/2+1)+m] - v[n*2*(N/2+1)+m]);
      if (err > maxerr)
        maxerr = err;
     }
  printf("maxerr = %gn",maxerr);[/cpp]

which gives the correct small error. Now all this works for dfti, next is to get it right for cdft, as well.

Thanks,
J

Jozsef · ‎02-12-2011

As I suspected, the issue was the same for cluster fft.

J

Vladimir_Petrov__Int · ‎02-13-2011

Jozsef,

I'm glad to hear that the Good vanquished the evil (once again).

Best regards,
-Vladimir