The page C/C++ OpenMP* and DPC++ Composability contains an example of using OpenMP to "offload" to the GPU. However when I tried using this I found that the example actually took more CPU time if it is compiled with OpenMP pragmas, than it the pragmas are removed.
I created two files openMP.cpp (exactly the example from the page) and noOpenMP.cpp (the same but with the pragmas commented out). I ran them with the following results:
ian@i3:~/openmp$ icpx -o withOpenMP -fsycl -fiopenmp -fopenmp-targets=spir64 openMP.cpp ian@i3:~/openmp$ icpx -o withoutOpenMP -fsycl -fiopenmp -fopenmp-targets=spir64 noOpenMP.cpp ian@i3:~/openmp$ time ./withOpenMP Vec = 512 Pi = 3.14159 real 0m0.931s user 0m1.561s sys 0m0.125s ian@i3:~/openmp$ time ./withoutOpenMP Vec = 512 Pi = 3.14159 real 0m0.180s user 0m0.155s sys 0m0.024s
With the pragmas, the program takes 5 times as long and uses nearly 10 times as much CPU as if the pragmas are omitted. As an attempt to offload the CPU, this is spectacular failure.
Are there any examples where OpenMP can actually be used to offload the CPU?
Thanks for reaching out to us.
By default, openMP will run on the CPU.
If we want to offload it to any specific GPU target, then only we use -fiopenmp -fopenmp-targets=spir64 option to enable offloading to a specified GPU target(Linux*) explicitly.
Please see the below scenarios:
1. To enable openMP and DPC++/SYCL constructs, use the below command:
icpx -fsycl -fiopenmp -fopenmp-targets=spir64 offloadOmp_dpcpp.cpp
-fsycl option enables DPC++
-fiopenmp -fopenmp-targets=spir64 option enables OpenMP* offload for GPU
** If we do not specify any target, then the default offloading to host/CPU will be done.
2. If the code does not contain OpenMP offload, but only normal OpenMP code, use the below command.
icpx -fsycl -fiopenmp omp_dpcpp.cpp
3. If there is no openMP code, then use the below command.
icpx -fsycl noOpenMP.cpp
We tried compiling and running the OpenMP & noOpenMP codes as below. And, we observed only a minimal change in time w.r.t program and CPU for both cases.
u67125@s001-n066:~/openmp$ icpx -fsycl -fiopenmp -fopenmp-targets=spir64 openMP.cpp -o withOpenMP u67125@s001-n066:~/openmp$ time ./withOpenMP Vec = 512 Pi = 3.14159 real 0m0.415s user 0m4.042s sys 0m0.198s u67125@s001-n066:~/openmp$ icpx -fsycl noOpenMP.cpp -o withoutOpenMP u67125@s001-n066:~/openmp$ time ./withoutOpenMP Vec = 512 Pi = 3.14159 real 0m0.301s user 0m1.950s sys 0m0.115s
Thanks & Regards,
Thank you for your reply. I was aware of the need to use " -fiopenmp -fopenmp-targets=spir64 " to offload to the GPU. Indeed as you will see from my original question, I used those options.
Whereas you have much less difference (probably due to the different hardware), you also had more CPU used in the "offload" case, than in the pure CPU case.
My original question, "Are there any examples where OpenMP can actually be used to offload the CPU?", remains. Can you provide an example of offloading significantly reducing CPU usage?
What is this "screenshot" that you refer to? Do you the terminal dialogue on host "u67125@s001-n066"?
In that case the "withOpenMP" case takes 4.042s + 0.198s = 4.240s CPU
whereas the "withoutOpenMP" case takes 1.950s + 0.115s = 2.065s CPU.
The with OpenMP case takes over twice as much CPU as the withoutOpenMP case. Far from offloading the CPU, it is doubling the CPU load.
If you are not referring to that terminal dialogue, what are you referring to?
>>"If you are not referring to that terminal dialogue, what are you referring to?
I was referring to the screenshot provided in my previous post. (The screenshot might take a few seconds to be updated at your end.)
We can see from the below screenshot that using OpenMP, CPU usage has been reduced to half the time when compared to that of not using OpenMP.
Thanks & Regards,
What is the hardware for this speed improvement? What CPU and what GPU?
I am using a Celeron 3965U CPU, and a Kaby Lake HD 610 (device 5906) GPU.
Even if you are achieving some CPU reduction, halving the CPU when you are offloading the complete task is pretty lame. My previous experience of offloading (about ten years ago using CUDA), I took a task that would have taken the CPU a few times over and used about 2% of CPU when offloaded to the GPU. I wasn't necessarily expecting the offload to be that good, but I was definitely hoping for a factor 10 reduction in CPU.
My CPU & GPU details are given below:
CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
GPU: Intel(R) UHD Graphics 630 [0x3e98]
Could you please let us know the CPU details that you have used 10 years ago?
Thanks & Regards,
I regret that I do not still have the hardware or remember precisely what it was. I have checked the project archives. It includes all of the code, but does not record the precise hardware.
The PC was a reasonably typical desktop machine of the time. The NVIDIA GPU was a mid-range one, one of the cheaper ones capable of GPGPU processing.
Sorry I cannot be more precise.