Solved: The kernel that I have is not

A2009 · ‎11-26-2014

My goal is to compute 1D FFT of a 3D array along all its dimensions. This 3D array is stored as a 1D array in a columnwise fashion. For example,

for( int k = 0; k < nk; k++ ) // Loop through the height.
    for( int j = 0; j < nj; j++ ) // Loop through the rows.
        for( int i = 0; i < ni; i++ ) // Loop through the columns.
        {
            ijk = i + ni * j + ni * nj * k;
            my3Darray[ ijk ] = 1.0;
        }

In fact, I need to pass all the rows/columns/height 1D vectors of `my3Darray` to FFT function of MKL library (by height, I mean the vectors in the third dimension off the array) and compute their Fourier transforms. Is there a possibility to utilize MKL Fourier libraries for such an 3D array that is stored as a 1D? I would like to perform the following operations coded by the example below. Please advise.

Heightwise FFT:

for( int i = 0; i < ni; i++ ) // Loop through the columns.
    for( int j = 0; j < nj; j++ ) // Loop through the rows.
    {
        for( int k = 0; k < nk; k++ ) // Loop through the heights.
        {
            ijk = i + ni * j + ni * nj * k;
            myvec[ k ] = my3Darray[ ijk ];
            fft_function( myvec, myvec_processed );
        }

        // Store the results in a new array, which is storing myvec_processed in my3Darray_fft_values.
        for( int k = 0; k < nk; k++ ) // Loop through the heights.
        {
            ijk = i + ni * j + ni * nj * k;
            my3Darray_fft_values[ ijk ] = myvec_processed[ k ];
        }
    }

Columnwise FFT:

for( int k = 0; k < nk; k++ ) // Loop through the height.
    for( int j = 0; j < nj; j++ ) // Loop through the rows.
    {
        for( int i = 0; i < ni; i++ ) // Loop through the columns.
        {
            ijk = i + ni * j + ni * nj * k;
            myvec[ i ] = my3Darray[ ijk ];
            fft_function( myvec, myvec_processed );
        }

        // Store the results in a new array, which is storing myvec_processed in my3Darray_fft_values.
        for( int i = 0; i < ni; i++ ) // Loop through the columns.
        {
            ijk = i + ni * j + ni * nj * k;
            my3Darray_fft_values[ ijk ] = myvec_processed[ i ];
        }
    }

row-wise FFT:

for( int i = 0; i < ni; i++ ) // Loop through the column.
    for( int k = 0; k < nk; k++ ) // Loop through the height.
    {
        for( int j = 0; j < nj; j++ ) // Loop through the row.
        {
            ijk = i + ni * j + ni * nj * k;
            myvec[ j ] = my3Darray[ ijk ];
            fft_function( myvec, myvec_processed );
        }

        // Store the results in a new array, which is storing myvec_processed in my3Darray_fft_values.
        for( int j = 0; j < nj; j++ ) // Loop through the row.
        {
            ijk = i + ni * j + ni * nj * k;
            my3Darray_fft_values[ ijk ] = myvec_processed[ j ];
        }
    }

Evarist_F_Intel · ‎11-27-2014

Just to be on the same page... Suppose you have 3D array A of size NKxNJxNI, where:

1st dimension has size NI and has stride 1
2nd dimension has size NJ and has stride NI
3rd dimension has size NK and has stride NJ*NI

So element A should be accessed as A[k*NJ*NI + j*NI + i]

It looks like you do 3D FFT. If so, for highest performance please use direct 3D DFTI call. You can do this using the code below:

DFTI_DESCRIPTOR_HANDLE desc = 0;
MKL_LONG sizes[] = {NK, NJ, NI};
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor(&desc, DFTI_PREC, DFTI_COMPLEX, 3, sizes);
DftiSetValue(desc, DFTI_INPUT_STRIDES,  strides);
DftiSetValue(desc, DFTI_OUTPUT_STRIDES, strides);
DftiCommitDescriptor(desc);

However, if you have to perform all 1D FFT separately, you can do it in the following way (no need to reshape A):

DFTI_DESCRIPTOR_HANDLE desc_x = 0, desc_y = 0, desc_z = 0;

DftiCreateDescriptor(&desc_x, DFTI_PREC, DFTI_COMPLEX, 1, NI);
DftiSetValue(desc_x, DFTI_NUMBER_OF_TRANSFORMS, NK*NJ);
DftiSetValue(desc_x, DFTI_INPUT_DISTANCE,  NI);
DftiCommitDescriptor(desc_x);

MKL_LONG strides_y[] = { 0, NI };
DftiCreateDescriptor(&desc_y, DFTI_PREC, DFTI_COMPLEX, 1, NJ);
DftiSetValue(desc_y, DFTI_NUMBER_OF_TRANSFORMS, NI);
DftiSetValue(desc_y, DFTI_INPUT_DISTANCE,  1);
DftiSetValue(desc_y, DFTI_INPUT_STRIDES,  strides_y);
DftiCommitDescriptor(desc_y);

MKL_LONG strides_z[] = { 0, NJ*NI };
DftiCreateDescriptor(&desc_z, DFTI_PREC, DFTI_COMPLEX, 1, NZ);
DftiSetValue(desc_z, DFTI_NUMBER_OF_TRANSFORMS, NJ*NI);
DftiSetValue(desc_z, DFTI_INPUT_DISTANCE,  1);
DftiSetValue(desc_z, DFTI_INPUT_STRIDES,  strides_z);
DftiCommitDescriptor(desc_z);


// FFT over X
DftiComputeForward(desc_x, A);

// some work here

// FFT over Y
for (int k = 0; k < NK; ++k)
  DftiComputeForward(desc_y, &A[k*NJ*NI]);

// some work here

// FFT over Z
DftiComputeForward(desc_z, A);

Please note, that in case of FFT over Y dimension (NJ) there are to different strides (+1 over NI, and +NJ*NI over NK). They could not be unified. MKL DFTI currently does not support 2D strides for NUMBER_OF_TRANSFORMS. That is why we have to make a loop in this case.

View solution in original post

Evarist_F_Intel · ‎11-26-2014

Hi,

MKL DFTI can handle such a situation. In fact your 3D array of sizes NKxNJxNI can be considered (from DFTI point of view) as 2D array of sizes (NK*NJ)xNI.

For example you can use the following code:

DFTI_DESCRIPTOR_HANDLE desc = 0;
DftiCreateDescriptor(&desc, DFTI_PREC, DFTI_COMPLEX, 1, (MKL_LONG)ni);
DftiSetValue(desc, DFTI_NUMBER_OF_TRANSFORMS, (MKL_LONG)(nk*nj));
DftiSetValue(desc, DFTI_INPUT_DISTANCE,  (MKL_LONG)(ni));
DftiSetValue(desc, DFTI_OUTPUT_DISTANCE, (MKL_LONG)(ni));
DftiCommitDescriptor(desc);

A2009 · ‎11-27-2014

Thanks Evarist. Does the code you kindly provided be able to compute all row-wise/column-wise/height-wise FFT as I coded in my question or I should reshape my original 3D array to the following sequences (I prefer not to as it is not efficient)? Please let me know.

Column-wise: (NKxNJ)xNI. This would require no reshape.
Row-wise: (NI*NK)*NJ. This would require reshaping.
Height-wise: (NIxNJ)xNK. This would require reshaping.

Evarist_F_Intel · ‎11-27-2014

Just to be on the same page... Suppose you have 3D array A of size NKxNJxNI, where:

1st dimension has size NI and has stride 1
2nd dimension has size NJ and has stride NI
3rd dimension has size NK and has stride NJ*NI

So element A should be accessed as A[k*NJ*NI + j*NI + i]

It looks like you do 3D FFT. If so, for highest performance please use direct 3D DFTI call. You can do this using the code below:

DFTI_DESCRIPTOR_HANDLE desc = 0;
MKL_LONG sizes[] = {NK, NJ, NI};
MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
DftiCreateDescriptor(&desc, DFTI_PREC, DFTI_COMPLEX, 3, sizes);
DftiSetValue(desc, DFTI_INPUT_STRIDES,  strides);
DftiSetValue(desc, DFTI_OUTPUT_STRIDES, strides);
DftiCommitDescriptor(desc);

However, if you have to perform all 1D FFT separately, you can do it in the following way (no need to reshape A):

DFTI_DESCRIPTOR_HANDLE desc_x = 0, desc_y = 0, desc_z = 0;

DftiCreateDescriptor(&desc_x, DFTI_PREC, DFTI_COMPLEX, 1, NI);
DftiSetValue(desc_x, DFTI_NUMBER_OF_TRANSFORMS, NK*NJ);
DftiSetValue(desc_x, DFTI_INPUT_DISTANCE,  NI);
DftiCommitDescriptor(desc_x);

MKL_LONG strides_y[] = { 0, NI };
DftiCreateDescriptor(&desc_y, DFTI_PREC, DFTI_COMPLEX, 1, NJ);
DftiSetValue(desc_y, DFTI_NUMBER_OF_TRANSFORMS, NI);
DftiSetValue(desc_y, DFTI_INPUT_DISTANCE,  1);
DftiSetValue(desc_y, DFTI_INPUT_STRIDES,  strides_y);
DftiCommitDescriptor(desc_y);

MKL_LONG strides_z[] = { 0, NJ*NI };
DftiCreateDescriptor(&desc_z, DFTI_PREC, DFTI_COMPLEX, 1, NZ);
DftiSetValue(desc_z, DFTI_NUMBER_OF_TRANSFORMS, NJ*NI);
DftiSetValue(desc_z, DFTI_INPUT_DISTANCE,  1);
DftiSetValue(desc_z, DFTI_INPUT_STRIDES,  strides_z);
DftiCommitDescriptor(desc_z);


// FFT over X
DftiComputeForward(desc_x, A);

// some work here

// FFT over Y
for (int k = 0; k < NK; ++k)
  DftiComputeForward(desc_y, &A[k*NJ*NI]);

// some work here

// FFT over Z
DftiComputeForward(desc_z, A);

Please note, that in case of FFT over Y dimension (NJ) there are to different strides (+1 over NI, and +NJ*NI over NK). They could not be unified. MKL DFTI currently does not support 2D strides for NUMBER_OF_TRANSFORMS. That is why we have to make a loop in this case.

A2009 · ‎11-27-2014

Wonderful explanations, which nicely put into codes. The reason I want to do 1D FFT is to compute 3D convolution of a 3D data by using three 1D FFT over x, y, and z since I did not find a dedicated 3D convolution function in MKL libraries. May I have your thoughts on this as well? From Intel MKL perspective, am I approaching this properly?

Evarist_F_Intel · ‎11-28-2014

Sorry for late response.

I think you are right that you use DFTI for making 3D convolution.

But I do not understand why you are not able to use direct 3D FFT. Do you use different kernels for different direction? Why can't you perform 3D FFT, then make element-wise multiplication with kernel and finally using inverse FFT get the result?

A2009 · ‎11-29-2014

I can use 3D FFT with the same kernel for different direction.

Could you take a look at the following code? I actually used the 3D FFT with the instructions you provided above but I get different results compared to MATLAB.

void FFT3D( double *in, int nRows, int nCols, int nHeights, 
            double *out )
{
    
    int NI = nRows;
    int NJ = nCols;
    int NK = nHeights;
    
    DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
    MKL_LONG sizes[]   = { NK, NJ, NI };
    MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };
    
    MKL_LONG status[6];
    
    status[0] = DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes);
    
    status[1] = DftiSetValue        (  fft_desc, DFTI_PLACEMENT     , DFTI_NOT_INPLACE);
    status[2] = DftiSetValue        (  fft_desc, DFTI_INPUT_STRIDES , strides);
    status[3] = DftiSetValue        (  fft_desc, DFTI_OUTPUT_STRIDES, strides);
    status[4] = DftiCommitDescriptor(  fft_desc );
    
    status[5] = DftiComputeForward  ( fft_desc, in, out );
    
    for( int i = 0; i < 6; i++ )
        if ( status && !DftiErrorClass(status, DFTI_NO_ERROR) )
           printf("Error: %s\n", DftiErrorMessage( status ));

    DftiFreeDescriptor ( &fft_desc );
    
}

int main()
{
    double a   [5*5*5];
    double afft[7*7*7];
    
    for( int i = 0; i < 5*5*5; i++ )
        a[ i ] = 1.0;
    
    FFT3D( a, 5, 5, 5, afft );
    
    return 0;
}

Here is the MATLAB code:

a = ones(5, 5, 5);
afft = fftn(a);

Evarist_F_Intel · ‎11-29-2014

I see two strange things.

The first one is that you say DFTI_COMPLEX in DftiCreateDescriptor and it means that Complex-to-Complex FFT will be used, but array a is initialized with real values.

If your input is real, there are two ways to handle such situation. The simplest way (but it takes twice the amount of memory) is to make your input array complex with imaginary part is equal to zero. The another way (slightly complicated, but better from performance point of view) is to use Real-to-Complex transform. The thing is that if you have real input data the result of FFT will have some kind of symmetry so you need not keep the whole array, half of array is enough. Please refer to r2c examples and documentation to get clearer picture.

The second problem (or probably not a problem) is that out array has different layout (at least it looks like that it has). If you really want to keep the result 5x5x5 in array with layout 7x7x7 (putting result values in the begging of corresponding direction) you need to set DFTI_OUTPUT_STRIDE properly, e.g. { 0, 7*7, 7, 1 }.

Simple convolution can be implemented in the following way:

void Conv3D(
    MKL_COMPLEX16 *in, MKL_COMPLEX16 *ker, MKL_COMPLEX16 *out,
    int nRows, int nCols, int nHeights)
{
    int NI = nRows;
    int NJ = nCols;
    int NK = nHeights;

    DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
    MKL_LONG sizes[]   = { NK, NJ, NI };
    MKL_LONG istrides[] = { 0, NJ*NI, NI, 1 };
    MKL_LONG istrides[] = { 0, NJ*NI, NI, 1 };

    DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_COMPLEX, 3, sizes);
    DftiSetValue        (  fft_desc, DFTI_INPUT_STRIDES , strides);
    DftiSetValue        (  fft_desc, DFTI_OUTPUT_STRIDES, strides);
    DftiSetValue        (  fft_desc, DFTI_BACKWARD_SCALE, 1,/NI/NJ/NK);
    DftiCommitDescriptor(  fft_desc );

    DftiComputeForward  ( fft_desc, in);
    DftiComputeForward  ( fft_desc, ker);

    for (long long i = 0; i < (long long)NI*NJ*NK; ++i)
        out = in*ker;

    DftiComputeBackward  ( fft_desc, out);

    DftiFreeDescriptor ( &fft_desc );
}

A2009 · ‎11-29-2014

The kernel that I have is not the same size of the input array. Kernel is real and usually of size 3x3x3 or 5x5x5 and my input array is real and has about 300x200x200 elements. For the computation of convolution, I want to perform "not-in-place" FFT on the input array and the kernel and prevent them from getting modified (I need to use them later in my code) and then do the backward computation "in-place". Here is the example that I wrote based on your explanations.

#include <stdio.h>
#include "mkl.h"

void Conv3D(
    double *in, double *ker, double *out,
    int nRows, int nCols, int nHeights)
{
    
    int NI = nRows;
    int NJ = nCols;
    int NK = nHeights;

	double *in_fft  = new double [NI*NJ*NK];
	double *ker_fft = new double [NI*NJ*NK];

    DFTI_DESCRIPTOR_HANDLE fft_desc = 0;
    MKL_LONG sizes[]   = { NK, NJ, NI };
    MKL_LONG strides[] = { 0, NJ*NI, NI, 1 };

    DftiCreateDescriptor( &fft_desc, DFTI_DOUBLE, DFTI_REAL, 3, sizes     );
	DftiSetValue        (  fft_desc, DFTI_PLACEMENT     , DFTI_NOT_INPLACE);   // Out-of-place computation.
    DftiSetValue        (  fft_desc, DFTI_INPUT_STRIDES , strides         );
    DftiSetValue        (  fft_desc, DFTI_OUTPUT_STRIDES, strides         );
    DftiSetValue        (  fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK      );
    DftiCommitDescriptor(  fft_desc );

    DftiComputeForward  (  fft_desc, in , in_fft  );
    DftiComputeForward  (  fft_desc, ker, ker_fft );

    for (long long i = 0; i < (long long)NI*NJ*NK; ++i )
        out = in_fft*ker_fft;

	// In-place computation.
	DftiSetValue        (  fft_desc, DFTI_PLACEMENT, DFTI_INPLACE );
	DftiCommitDescriptor(  fft_desc      );
    DftiComputeBackward (  fft_desc, out );

    DftiFreeDescriptor  ( &fft_desc );

	delete[] in_fft;
	delete[] ker_fft;

}

int _tmain(int argc, _TCHAR* argv[])
{
	int n = 10;
	int nkernel = 3;

	double *a          = new double [n*n*n]; // This array is real.
	double *aconvolved = new double [n*n*n]; // The convolved array is also real.
	double *kernel     = new double [nkernel*nkernel*nkernel]; // kernel is real.
    
	// Fill the array with some numbers.
    for( int i = 0; i < n*n*n; i++ )
        a[ i ] = 1.0;

	// Fill the kernel with some numbers.
	for( int i = 0; i < nkernel*nkernel*nkernel; i++ )
        kernel[ i ] = 1.0;

	// Calculate the convolution.
    Conv3D( a, kernel, aconvolved, n, n, n );
    
	printf("Convolved:\n");
    for( int i = 0; i < n*n*n; i++ )
		printf( "%15.8f\n", aconvolved );

	delete[] a;
	delete[] kernel;
	delete[] aconvolved;

	return 0;
}

Evarist_F_Intel · ‎11-29-2014

For convolution through FFT both arrays should have the same size. So the kernel should have the size NxNxN and should be padded with zeroes (in case you are interested in cyclic convolution). Otherwise, you need to pad source array and kernel array to the size of (n+nkernel-1)x(n+nkernel-1)x(n+nkernel-1). If your kernel is really small (3x3x3, 5x5x5) probably it would not be beneficial to use FFT for performing convolution. You can roughly estimate computational difference using the following formulas:

convolution: 2*(n^3)*(nkernel^3)
FFT: 15*(n^3)*log(n^3) + (n^3)

Also, please note that line

DftiSetValue        (  fft_desc, DFTI_BACKWARD_SCALE, 1/NI/NJ/NK      );

should be replaced with

DftiSetValue        (  fft_desc, DFTI_BACKWARD_SCALE, 1./((double)NI*NJ*NK)      );

Otherwise backward scale will be zero. (I mistyped in my previous post -- it should be dot there, not comma :) )

A2009 · ‎11-29-2014

Thanks for clarification. In my search for a function to do convolution in 3D, I was not able to find any in MKL libraries. There are functions for 1D and 2D but not in 3D. That is why I am trying to use MKL FFT libraries to get the convolution because it is highly optimized. Writing a direct 3D convolution is not difficult but writing it efficiently and robustly is not what I can do.

I would be thankful if you could advise me to choose the right tool for my application and the size of the arrays I have. Can MKL help me?