About DO CONCURRENT Accelerator Offload and data transfer

cu238 · ‎11-02-2023

In the article from

https://www.intel.com/content/www/us/en/developer/articles/technical/using-fortran-do-current-for-accelerator-offload.html

, it is said that "The bad news is that edge detection on read-only images isn’t one of them. There’s no way to explicitly control data transfer, so unnecessary data transfer is unavoidable." for DO CONCURRENT accelerator offload.

I want to figure out that with "!$OMP TARGET DATA", "!$OMP TARGET DATA ENTER" or other methods, could the compiler recognize that the variables are already in device and avoid data transfer for them in a DO CONCURRENT accelerator offload construct?

I think it is critical and necessary because DO CONCURRENT accelerator offload makes coding much easier.

Thanks.

JohnNichols · ‎11-03-2023

https://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-23-23992

The paper referred to in the Intel document.

Until I read this report, I had some fears about Fortran being alive in 20 years; after I read the report, I knew my fears were misplaced.

I have not had such a good laugh since I found out that female welders in the UK in WW2 were paid 3 pounds per week and men paid 5 pounds per week and the UK government paid the shipyards 7 pounds per week for all welders, you do the math.

Never let it be said we cannot exploit people. I am joking about the laugh I have 4 daughters.

Ron_Green · ‎11-06-2023

With DO CONCURRENT, each DO CONCURRENT block is treated separately. We currently do not look at data movement that may have happened outside of the DO CURRENT. So it will not be aware of any OpenMP TARGET DATA that you may have done. It is something we are looking at for possibly adding in the future.

OpenMP gives much finer control of parallelism than DO CONCURRENT.

cu238 · ‎01-05-2025

After a year of study and exploration, I found something new.

I rewrite code for a 2D numerical model into DO CONCURRENT form.

On Intel Arc A750 GPU with ifx compiler, the model is slower than serial code running on 1 CPU core of i5-12400F @ 2.5GHz.

But on NVidia A800 GPU with nvfortran compiler, the model is 24 times faster than serial code running on 1 CPU core of Xeon Gold 6348 @ 2.6GHz.

I don't know how NVidia fix this problem. Maybe the "Unified Memory"? Or they automatically detect whether each data transfer is necessary? Or data transfer is natively faster on NVidia platform?

I wish Intel compiler has similar solution on this problem.

Is there any news?