My input data is N rows X M columns matrix. Each cell is a float number.
The first stage:
Subtract each row from its previous one.
The output data is (N-1) rows X M columns.
For the subtraction, I think (not sure) I have to keep the input matrix and put the output in a new matrix.
FFT on each row. The output is (N-1) rows X M columns.
For the FFT process, the work item is a butterfly. for M items in a row I have M/4 butterflies.
Is it possible to do the 2 operations without coming back to the host after the first stage ?
Depends on your hardware. If you have 5th or 6th generation Intel processors (Broadwell or Skylake), which support OpenCL 2.0, you can enqueue the second kernel from the first one (see https://software.intel.com/en-us/articles/gpu-quicksort-in-opencl-20-using-nested-parallelism-and-wo... for example on how to do that or https://software.intel.com/en-us/articles/sierpinski-carpet-in-opencl-20 for a toy example).