Multiple iteration of single Task Kernel

Lorenzo_C_ · ‎07-17-2017

Hi everyone.

I am having a very weird problem. So basically I have converted my parallel kernels to one sequential kernel. I have modified host code and done cross checking of code and so on. It does work. I do clean buffer before each iteration but for some reason after a certain number of iterations the kernel fails and it throws this error: CL_INVALID_COMMAND_QUEUE. So as far as I know this means the kernel has failed but it doesn't make any sense because it works for previous iterations and then it doesn't for new ones.
So in order to overcome this problem I re-initialize all the OpenCL variables (command queue, context and so on) dynamically once in a while every 50 iterations and now it goes through all the iterations.
I am running the code on my NVIDIA GPU. What could it be causing this problem? I do release the buffers and re-initialize them... Also if I run it on CPU it fails randomly..

One more thing is, if I want to run it on GPU all clReleaseMemObject calls have to be changed to clRetainMemObject otherwise it would sometime throw again that error after less iterations, On CPU is the opposite, I need to switch all clRetainMemObject to clReleaseMemObject to make it work but now it's not working.

Lorenzo_C_ · ‎07-18-2017

What happened to this thread?! It says it's there in OpenCL folder but it's not!

Jeffrey_M_Intel1 · ‎07-19-2017

It may help us to find some advice for this issue if you can provide more details on the hardware you are running on. Are you saying you are seeing this behavior on an NVidia GPU or that it works on NVidia but not Intel?

Lorenzo_C_ · ‎07-22-2017

Yeah basically I am running the code on my NVIDIA GEFORCE GTX 960M and it does work on it if I reset the OpenCL variables when I reach half of the total iterations. If I don't it will crash/throw an access reading or writing violation error. Also if I run it on GPU I need to use clRetainMemObject to release buffers instead of clReleaseMemObject as I do when I use CPU. If I don't do this it will throw an error saying: "CL_OUT_OF_RESOURCES" and I can't understand why GPU wants clRetainMemObject in order to work.

My CPU is Intel I7-4720HQ and it works on it but crashes very soon after few iterations. I am trying to debug but it's hard to find the error.

Lorenzo_C_ · ‎07-23-2017

I have also checked structure padding between host and device as far as I can. I have checked total struct size and position of elements and everything looks correct. To make sure on device the structs are store as the sum of the size of the data types I have included __attribute__ ((packed)) to each struct declaration. It is still crashing on CPU... If it works on GPU how can it throw an access reading/writing violation when running on CPU?!

Lorenzo_C_ · ‎07-25-2017

I have found the problem.. It was a out of pointer exception but you can't see that unless you debug and find the exact point where it fails. It was working on GPU maybe because of the type of memory allocation or something else.

Jeffrey_M_Intel1 · ‎07-25-2017

Glad you found the problem. Please watch this forum for more info on debugging/analysis capabilities of our Intel tools as they continue to improve.

Lorenzo_C_ · ‎07-30-2017

Ok so I do really need help now. I have tried anything but nothing works. I'll explain what I want to achieve and then what I have tried but didn't work.

I need to duplicate data and pass it to the kernel and run it as an NDRange kernel. I have many buffers which are arrays of integers or doubles or structs. I need to duplicate these and use it for each work item to do something with its own copy of data. The data will be a lot if I run many work-items but I can't even get 2 to run. Basically say I have one struct and I declare a pointer and then allocate memory for two structs of this type then I copy it using memcpy or new operator either way I get problems anyway. So when I run it copying data only for one work-item everything is fine. If I copy the data two times on the buffers then it fails even if I use only 1 work-item. I can't understand why. I have tried anything possible from simple malloc, new, aligned_malloc and so on. I have tried making 1 struct with all the data which can be cloned for another work item but it's big and I am not sure why it doesn't work with just 1 work item and 1 struct copied with all data inside. I just can't make it work...

I have the program (which I cannot share) that works but I need to replicate the data for each work-item such that they can work in parallel with same data. I have tried changing the data padding of the structure or alignment (which I quite get how it works but I can't find a simple and easy explanation of it) but now even the first version using this padding doesn't work.

I don't know what is the problem. Is it the padding/aligment? If so why my first version of the program works...

Is there anyone that has got a solution or a similiar issue?

My friend is doing the same thing on CUDA but he is fine with it. I need to use OpenCL and it just sucks honestly with all these limitations..

Thanks for anyone that will take the patience to help me.

Jeffrey_M_Intel1 · ‎07-31-2017

Not sure if this is the same thing, but I have had experiences like this when trying to figure out how to map the NDRange coordinates for work items back to buffer locations. It can be very tricky to figure out. Fortunately, Intel OpenCL has some very nice debugger capabilities.

For CPU: CodeBuilder can create a nice example with all paths set up for debugging within a few seconds.

For GPU, we have more documentation here: https://software.intel.com/en-us/articles/gpu-debugging-challenges-and-opportunities

Lorenzo_C_ · ‎08-03-2017

Jeffrey M. (Intel) wrote:

Not sure if this is the same thing, but I have had experiences like this when trying to figure out how to map the NDRange coordinates for work items back to buffer locations. It can be very tricky to figure out. Fortunately, Intel OpenCL has some very nice debugger capabilities.

For CPU: CodeBuilder can create a nice example with all paths set up for debugging within a few seconds.

For GPU, we have more documentation here: https://software.intel.com/en-us/articles/gpu-debugging-challenges-and-opportunities

I have tried setting up a new template project but it doesn't tell me anything new. Debugging doesn't show where it crashes.

I am now focusing on understanding why allocating memory two times per buffer makes the program fail even if I run the kernel with one work-item. So just allocating two times without touching anything else makes it fail on host code like if the data read from the kernel is wrong. What could it be? What's the best way to allocate structures contiguously in an array?