Re:Device Build Issue

Laurence · ‎03-15-2021

My code builds and runs fine on the CPU. However, although it builds fine for the GPU, after it starts running it is killed as it uses too much memory. This seems to happen on the first queue.submit call and stops if I comment out some of the later kernels. My assumption is that the first queue.submit call causes all kernels to be compiled for the device and one of the kernels I commented out causes this situation. Is my understanding of this correct and if so how can I debug it?

RahulV_intel · ‎03-16-2021

Hi,

DPC++ runtime follows asynchronous execution. If the kernels don't have any data dependency between them, they can execute simultaneously. One way to synchronize these kernels is to put a queue.wait() call after every kernel.

In your case, probably the memory is getting used up by multiple kernels (concurrent kernel execution) due to limited global memory availability on GPU. Whereas, on CPU it may be working fine due to higher global memory availability.

You may read more about DPC++ concepts in the link below:

https://www.apress.com/us/book/9781484255735

Thanks,

Rahul

Laurence · ‎03-16-2021

Thanks for the reply. All of my kernel calls have queue.wait() afterwards so they should not be executing simultaneously. Am I correct in assuming that the all the kernels are built for the device on the first queue.submit call?

P.S. I have read the book you suggested from cover to cover.

Laurence · ‎03-17-2021

I confirmed it is a build issue by trying to build for a Nvidia V100. The process is killed when running the following command.

"/usr/local/cuda-10.1//bin/ptxas" -m64 -O0 -v --gpu-name sm_50 --output-file /tmp/extras-85f00b.o /tmp/extras-5d7db0.s

A custom complex number class is contained within extras. The issue is triggered by the setpolar function.

void complex::setpolar(double mag, double phi)
{
 phi = phi * M_PI / 180;
 r = sycl::cos(phi) * mag;
 i = sycl::sin(phi) * mag;                                                                                                                                                                                                                                                      
}

I could not reproduce this with a simple test case but it occurs when I compile the full code base.

Laurence · ‎03-22-2021

I manged to work around the issue by using a different implementation of the complex number class. The following commands can be used to reproduce the original error.

git clone https://github.com/lfield/madgraph4gpu.git
cd madgraph4gpu
git checkout e27b31be
cd epoch2/sycl/gg_ttgg
source /opt/intel/oneapi/setvars.sh 
cmake -B build
cmake --build build
cd SubProcesses/P1_Sigma_sm_gg_ttxgg
./check_sa.exe  1 4 1

RahulV_intel · ‎03-25-2021

Hi Laurence,

Apologies for the late response.

>> Am I correct in assuming that the all the kernels are built for the device on the first queue.submit call

Your assumption is right. The DPC++ compiler produces an intermediate representation known as SPIRV for JIT(Just in Time) compilation. When the first kernel is hit, the entire SPIRV module it belongs to is compiled. In the case of multiple kernels (like in your case), when the first kernel gets hit, the entire SPIRV module containing multiple kernels gets compiled.

I followed your build commands and tried running the application on Gen9 iGPU. The original issue is reproducible. Are you able to solve this issue with the help of the workaround that you have mentioned? If yes, could you please share your workaround for the benefit of other community users?

Since you are running your application on Nvidia V-100, could you let me know whether you are using the Github version of DPC++ that supports the CUDA backend?

Thanks,

Rahul

Laurence · ‎03-26-2021

The issue was resolved by using a different implementation of the complex number class which does not use trig functions. Investigations showed that commenting out the trig functions, would allow the code to build. The setpolar function worked in a simple test case. The cause is still unknown.

I used both dpcpp on my Intel NUC and the Github version of DPC++ on a different machine to support the CUDA backend. I only built for the V100 as it was already setup for AOT compilation. The fact that it fails in both of these situations suggests that it is a code issue rather than a compiler issue.

RahulV_intel · ‎03-31-2021

Hi,

It looks like a code issue like you mentioned. Since your issue is resolved, shall I go ahead and close this thread from my end?

Thanks,

Rahul

RahulV_intel · ‎04-05-2021

Hi,

I have not heard back from you, so I will go ahead and close this thread from my end. Intel will no longer monitor this thread. Feel free to post a new query if you require further assistance from Intel.

Thanks,

Rahul