Solved: CDFT transposed data distribution

DavidBayer · ‎08-18-2024

Hi,

I am currently integrating Intel CDFT into my project and came across some problems with transposed data distribution. As the documentation says, CDFT uses slab (1D) decomposition in the slowest axis. The distribution can be obtained by calling `DftiGetValueDM(...)` with parameters `CDFT_LOCAL_XN` and `CDFT_LOCAL_X_START`. Example (forward transform):

src[loc_n0][n1][n2] -> dst[loc_n0][n1][n2],

where `loc_n0` has the size of returned `CDFT_LOCAL_NX`.

In case I want to use transposed data distribution, I set the `DFTI_TRANSPOSED` parameter with `DFTI_ALLOW` via `DftiSetValueDM(...)` function. If I understand the documentation correctly, the data should be distributed in second slowest axis. Example (forward transform):

src[loc_n0][n1][n2] -> dst[loc_n1][n2][n0]

where `loc_n0` has the size of returned `CDFT_LOCAL_NX`. Do I get it right? And in case I do, how can I find out the `loc_n1` parameter?

Thank you for your response.

David

Ruqiu_C_Intel · ‎09-29-2024

Hi David,

The formula is correct. oneMKL user can use that to determine the data distribution rather than using 'CDFT_LOCAL_OUT_X_START'.

And we also glad let you know that we will enhance the working of 'CDFT_LOCAL_OUT_NX' in case of DFTI_TRANSPOSE=DFTI_ALLOW for the new product release.

Regards,

Ruqiu

View solution in original post

Ruqiu_C_Intel · ‎09-23-2024

Hello David,

Thank you for contacting to us. We carefully and thoroughly reviewed the CDFT source code for the quesiton.

Yes, you are right. If DFTI_TRANSPOSED=DFTI_ALLOW, then you can expect src[loc_n0][n1][n2] -> dst[loc_n1][n2][n0]. Even "CDFT_LOCAL_OUT_NX" is not working, while users can query "CDFT_LOCAL_OUT_X_START" to obtain the rows shift in the output array which should find out the `loc_n1` parameter. Hopefully it helps you.

Regards,

Ruqiu

DavidBayer · ‎09-24-2024

Hello Ruqiu,

thank you for your answer. So, I cannot use "CDFT_LOCAL_OUT_NX" to query the local part, but I may query "CDFT_LOCAL_OUT_X_START" to find out the offset for the current process and then subtract the obtained value from the nearest lesser offset?

I did a bit more research and tried to go through mkl cdft fftw3 wrappers and found out that the wrapper uses this formula to compute the distribution along the given axis:

localN = (N + (P - 1)) / P for first N % P processes,

localN = N / P for the rest of the processes.

where N is the axis size, P is the number of processes and localN is the local part for each process. If I am correct, this is the same distribution of work as used by multi-gpu cuFFT. Can I take it as the right way to compute the data distribution?

Anyway, it would be great if the MKL documentation described the way to obtain the transposed data distribution. I think that it would be also really helpful to be able to obtain the distribution via the "CDFT_LOCAL_OUT_NX" parameter query.

Thank you very much!

Best regards

David

Ruqiu_C_Intel · ‎09-29-2024

Hi David,

The formula is correct. oneMKL user can use that to determine the data distribution rather than using 'CDFT_LOCAL_OUT_X_START'.

And we also glad let you know that we will enhance the working of 'CDFT_LOCAL_OUT_NX' in case of DFTI_TRANSPOSE=DFTI_ALLOW for the new product release.

Regards,

Ruqiu