Re: Help with porting a CUDA code to OpenCL

Altera_Forum · ‎06-29-2017

Hi everyone.

I am porting a CUDA kernel to OpenCL kernel to execute it on a FPGA (Stratix V).

The code looks very similar so it shouldn't be a problem except that this FPGA board supports only OpenCL 1.0.

So I have some questions regarding the code I have to port.

1) With CUDA you can call different functions asynchronously and allocate and store the data on the GPU regardless of the kernel you are executing (as far as I understood). Is this possible with OpenCL?

2) This algorithm is partially sequential and parallel. How can I achieve the same behavior without loosing performances with OpenCL? Can I use more kernels? And if yes how does that work?

3) Any suggestion about how should I approach this?

Thank you very much for those who will be able to help me.

Altera_Forum · ‎06-30-2017

Ok after a long research now I have a good idea about how to port a CUDA code to Opencl. The only thing is now bothering me is this:

Assume I have kernels and they call __device__ functions in CUDA which do not exist in OpenCL but they are simple functions called by the kernel and these functions lunches other kernels with a particular global dimension and local dimension, how can I translate this easily into OpenCL? I know that with OpenCL 1.0 I cannot launch kernels from kernels so what's the best way to go about this?

Thanks..

Altera_Forum · ‎06-30-2017

1) You can create multiple queues and have multiple kernels or buffer copy calls running in parallel, and then synchronize them using events. Note that global memory consistency is only guaranteed at the end of kernel execution and hence, if you try to share a read_write buffer between multiple queues, you will get undefined behavior, unless you synchronize the calls using events.

2) You can have as many kernels as you want in each queue. Kernel execution calls are non-blocking, and kernels will be queued on the device and executed in-order. You can put the two different parts of your code in two different kernels and run them in-order.

3) I would remove all GPU-based optimizations from the code and convert the kernel to a sequential C code and compile it as a single work-item kernel.

You probably need to spend a good deal of time reading Khronos's OpenCL documents and also Altera's OpenCL documents (getting started guide, programming guide, best practices guide). Don't concern yourself with what version of OpenCL is supported on the FPGA, you should not need to use the features that only exist in newer versions just yet.

Altera_Forum · ‎07-03-2017

I am struggling to understand how to covert a NDRange for OpenCL from CUDA.

Say this is my kernel launch for CUDA: initMatrix << < _h_cols, _h_rows >> > (var1, va2, var3, var4);

Cols and Rows are just integer numbers and the matrix has been made flat so no 2D.

How can I convert this to a OpenCL NDRange?

Thanks..

Altera_Forum · ‎07-03-2017

If your issue is with general OpenCL usage, and not problems specific to using OpenCL on FPGAs, I recommend searching online or consulting with stackoverflow. You could take a look at the following pages for starters:

https://github.com/vtsynergy/cu2cl

https://www.sharcnet.ca/help/index.php/porting_cuda_to_opencl

Altera_Forum · ‎07-03-2017

The problem I am encountering is that the number of columns and rows of my matrix are not multiples or dividers of each other. Can someone help me with this?

Altera_Forum · ‎07-08-2017

Ok so I am getting far with this project and porting is getting easier. But now I am experiencing this problem when I run this kernel:

__kernel void computeRouteStrength(__global double* restrict routeStrength,__global int* restrict order ,__global double* restrict routePheromones,__global Route* restrict possibleRoutesPerOrderMatrix, 
                                    __global int* restrict cols, __global double* restrict beta, __global int* restrict dailyOrderLimit, __global int* restrict productionCapacities)
{
    int i = get_global_id(0);
    int route = (*order * *cols) + i;
    
    double strength = -1;
    if (possibleRoutesPerOrderMatrix.order != -1)
    {
        double pheromone = routePheromones;
        double heuristicInformation = 1/possibleRoutesPerOrderMatrix.heuristicInformation;
        if (canBeShipped(&possibleRoutesPerOrderMatrix, dailyOrderLimit, productionCapacities))
        {
            strength = pheromone * heuristicInformation;
        }
    }
    routeStrength = strength;
}

So basically if I remove the calculation between the if statement the kernel doesn't fail. If I do put any calculation there then it fails saying: INVALID COMMAND QUEUE.

What could be the problem? The index should not be a problem also the arrays have been flattened (from 2d to 1d) and I am using them in other kernels.

Thanks..

Altera_Forum · ‎07-09-2017

Unless you are converting the OpenCL error number incorrectly, "INVALID COMMAND QUEUE" has nothing to do with what is in your kernel; it is because you have some issue in your host code. Find the OpenCL function in the host that is throwing this error and fix it.

Altera_Forum · ‎07-09-2017

--- Quote Start ---

Unless you are converting the OpenCL error number incorrectly, "INVALID COMMAND QUEUE" has nothing to do with what is in your kernel; it is because you have some issue in your host code. Find the OpenCL function in the host that is throwing this error and fix it.

--- Quote End ---

I have used clerrors.h from the tests project of the fpga board to decode the error. Also if I run the kernel with some data it doesn't fail. If I do with other data it does fail. I am not sure why. Data doesn't look to be corrupted.

Altera_Forum · ‎07-09-2017

Apparently that kernel does run if make sequential. As soon as I run it as a parallel execution it fails. Yes I have made changes in the code for each type of execution.

So what I am thinking is a race condition when writing to the same array. Is that correct/possible?

I am trying to use atomic functions to see if I can solve this problem but when I compile I get "note: candidate function has been explicitly made unavailable". What does it mean?

I am using double variables in my code.

Thanks

Altera_Forum · ‎07-09-2017

As I mentioned earlier:

--- Quote Start ---

Note that global memory consistency is only guaranteed at the end of kernel execution and hence, if you try to share a read_write buffer between multiple queues, you will get undefined behavior, unless you synchronize the calls using events.

--- Quote End ---

So, yes, do not do this unless you are using events to synchronize kernel calls from the host.

Regarding the "note: candidate function has been explicitly made unavailable" message, Altera's Programming Guide says:

The SDK does not support 64-bit atomic functions described in Section 9.7 of the OpenCL Specification version 1.0.

That is probably the source of your problem.

Altera_Forum · ‎07-09-2017

--- Quote Start ---

As I mentioned earlier:

So, yes, do not do this unless you are using events to synchronize kernel calls from the host.

Regarding the "note: candidate function has been explicitly made unavailable" message, Altera's Programming Guide says:

The SDK does not support 64-bit atomic functions described in Section 9.7 of the OpenCL Specification version 1.0.

That is probably the source of your problem.

--- Quote End ---

THank you very much. I was expecting that it wasn't supported. By the way I am not running multiple queues and I am always waiting for the end of one kernel execution before launching another so my question was about a single kernel launched with a 1D NDRange but with multiple workgroups. So if in this case the multiple workgroups access and write to the same global array would this cause a problem? Thanks

Altera_Forum · ‎07-09-2017

If you are using num_compute_units to replicate the kernel, and your accesses are random in a way that two different work-items from different work-groups might try to read/write the same memory location, then yes, this is certainly possible. Without num_compue_units, this shouldn't happen unless there is some race condition in the kernel itself (i.e. incorrect code which would also give incorrect results on CPU/GPU).

Altera_Forum · ‎07-09-2017

--- Quote Start ---

If you are using num_compute_units to replicate the kernel, and your accesses are random in a way that two different work-items from different work-groups might try to read/write the same memory location, then yes, this is certainly possible. Without num_compue_units, this shouldn't happen unless there is some race condition in the kernel itself (i.e. incorrect code which would also give incorrect results on CPU/GPU).

--- Quote End ---

I am not using num_compute_units and I don't know what it is so I guess that is not a problem. So running that kernel sequentially or in parallel should not be a problem I guess. I'll write here If I find something else. Thanks for the help.

Altera_Forum · ‎07-09-2017

Ok so the program is running perfectly on GPU. I have compiled the kernels used in the program and flashed them on the FPGA. Rebooted, checked with "aocl diagnose" that FPGA was communicating. used "aocl program" command to check. Then I runned the program on the FPGA and it gave me some access violation reading memory.... It is not the first time I run something on FPGA. What could it be?!

Altera_Forum · ‎07-10-2017

No idea. You don't really need to flash the FPGA manually, though; the OpenCL runtime will automatically do this during execution.

Altera_Forum · ‎07-10-2017

Ok so I have managed to run the same code on both CPU and GPU without having problems or crashes. But if I want to run it on GPU all clReleaseMemObject calls have to be changed to clRetainMemObject otherwise it would sometime output wrong values and take much more time to finish run. On CPU is the opposite, I need to switch all clRetainMemObject to clReleaseMemObject to make it work otherwise the program throws CL_OUT_OF_RESOURCES when creating some momentary buffers during execution.

So what is going on? What is the difference between the two? Thanks for the support.

Altera_Forum · ‎07-10-2017

I have never used clRetainMemObject. Are you using these calls in-between the execution of your kernels? These are generally used at the end of execution for cleanup purposes, and in such case, will in no way affect data integrity. Are you using blocking read/write buffer calls? Make sure you are not releasing a device buffer right after a non-blocking read from it (which can obviously cause data corruption).

Altera_Forum · ‎07-10-2017

--- Quote Start ---

I have never used clRetainMemObject. Are you using these calls in-between the execution of your kernels? These are generally used at the end of execution for cleanup purposes, and in such case, will in no way affect data integrity. Are you using blocking read/write buffer calls? Make sure you are not releasing a device buffer right after a non-blocking read from it (which can obviously cause data corruption).

--- Quote End ---

I am always using blocking reading/writing calls. I am releasing only buffers every time I finish with a set of data and then start again with another set of data. I just don't understand why one works only with GPU and the other only with CPU.

CPU gives me problems of resources when I use the "wrong one"... Is there anyone that can help me figure out this? Thanks.

Altera_Forum · ‎07-17-2017

Ok so thanks everyone first. Now I am having a very weird problem. So basically I have converted the kernels to one sequential kernel. I have modified host code and done cross checking of code and so on. It does work. I do clean buffer before each iteration but for some reason after a certain number of iterations the kernel fails and it throws this error: CL_INVALID_COMMAND_QUEUE. So as far as I know this means the kernel has failed but it doesn't make any sense.

So in order to overcome this problem I re-initialize all the OpenCL variables (command queue, context and so on) once in a while after some iterations and now it goes through all the iterations.

I am running the code on my NVIDIA GPU. What could it be causing this problem? I do release the buffers and re-initialize them... Also if I run it on CPU it fails randomly..