Re: questions about buffer size and parameter size.

Altera_Forum · ‎10-27-2017

I am implementing sparse matrix multiply on Nalla 510t, and have some problem transfer my matrix.

The problem has I stripped the matrix into 512 pieces for parallel computation.

First I tried to create 512 cl_mem instances and pass the 512 pointers to the kernel function, but find there is a limitation on kernel argument size, and only support about 100 pointers. So it doesn't compile.

Then I tried to create a single cl_mem instance for the image and another cl_mem instance for the 512 offset addresses as indices, but when compile, it is reported that the cl_mem is too large(I set the size to 1024*1024*512 uint type, which should be 2GB). the design can only be compiled when setting the buffer size less than 2GB.

I want to run some larger matrix, and since Nalla 510t has 16GB on-board DDR memory, it should be physically supported. But how can I achieve this using OpenCL?

Thanks in advance for any help!

Altera_Forum · ‎10-28-2017

You need to break up your data into smaller workgroups within the total NDRange. Use required or max workgroup size attributes and launch the kernel appropriately to support the workgroup size you've selected. See the best practices guide for info on setting a workgroup size.

Altera_Forum · ‎10-28-2017

Would you please post your compilation log or whatever error you get? Are you talking about kernel compilation or host compilation here?

The size of your cl_mem buffer has nothing to do with kernel compilation, because the kernel compiler only sees the pointer to the buffer and does not know or care about the actual size of the buffer. The host compiler does not care about the size of the buffer either, you will just get an out of memory error from the OpenCL runtime during execution if your buffer is larger than the physically-available on-board memory. The only case your compilation will fail is when you try to create a very large "local" buffer on the FPGA.

Altera_Forum · ‎10-30-2017

Do you mean since kernel cannot support such big work size, I should split it into several sub-kernels?

This makes sense to me, but I will need to change my kernel code a lot. :(

--- Quote Start ---

You need to break up your data into smaller workgroups within the total NDRange. Use required or max workgroup size attributes and launch the kernel appropriately to support the workgroup size you've selected. See the best practices guide for info on setting a workgroup size.

--- Quote End ---

Altera_Forum · ‎10-30-2017

Hi, HRZ, you are correct. The code can be compiled (both host and kernel), and got error message during runtime.

The error message for large cl_mem is:

Error: Requested memory object size exceeds device limits.

I didn't save error message when comes with too many parameters. I can remember when the parameters a litter too many(about 100), it reports parameter overflowed, and arguments are not correctly passed to the kernel. However when the parameter is too many (more than 256), there is no implicit error message, but runtime stack error, which costs me sometime to figure out what is happening.

Since each FPGA on the board has 4 DDR channel, each channel is 4GB size, why cannot I declare a 2GB buffer? maybe the only solution is sstrell's solution, to divided the data set.

--- Quote Start ---

Would you please post your compilation log or whatever error you get? Are you talking about kernel compilation or host compilation here?

The size of your cl_mem buffer has nothing to do with kernel compilation, because the kernel compiler only sees the pointer to the buffer and does not know or care about the actual size of the buffer. The host compiler does not care about the size of the buffer either, you will just get an out of memory error from the OpenCL runtime during execution if your buffer is larger than the physically-available on-board memory. The only case your compilation will fail is when you try to create a very large "local" buffer on the FPGA.

--- Quote End ---

Altera_Forum · ‎10-31-2017

--- Quote Start ---

Do you mean since kernel cannot support such big work size, I should split it into several sub-kernels?

This makes sense to me, but I will need to change my kernel code a lot. :(

--- Quote End ---

No, not separate kernels. Break up your NDRange into smaller workgroups that can fit in the hardware (use the localworksize argument when launching the kernel with clEnqueueNDRange Kernel on the host side and use a maximum or required workgroup size attribute on the kernel side).

Altera_Forum · ‎10-31-2017

--- Quote Start ---

The error message for large cl_mem is:

Error: Requested memory object size exceeds device limits.

--- Quote End ---

Please post the part of your host code that is generating that error message, alongside with actual values for all parameters passed onto that function. Your very likely have a mistake somewhere in your host code.

Also you should never write your code in a way that you need to split your buffers on the host, or pass 100 parameters to the kernel. This is certainly not the correct way to write OpenCL code. If you are not familiar with OpenCL, I strongly recommend looking at some basic non-FPGA examples and write some basic OpenCL code on CPUs and GPUs first and then move onto to FPGAs. Altera also has a lot of examples here (https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-zone.html) which you can look at.

Altera_Forum · ‎11-01-2017

My kernel is actually RTL code, wrapped up in OpenCL function. The RTL part has 256 base address as input, and 256 channels inside the RTL will fetch data simultaneously. That are some arbitration logic in the RTL as well and eventually only 4 input avalon interface and 1 output avalon interface in the wrapper.

In this sense, I cannot figure out a safe way to break up the NDRange without modify the RTL code:(

Any suggestions?

--- Quote Start ---

No, not separate kernels. Break up your NDRange into smaller workgroups that can fit in the hardware (use the localworksize argument when launching the kernel with clEnqueueNDRange Kernel on the host side and use a maximum or required workgroup size attribute on the kernel side).

--- Quote End ---

Altera_Forum · ‎11-01-2017

The host will divide the sparse matrix into 256 strips, and for each strip, the size many be different. Then pass the matrix to the kernel as well as the 256 offset values. The kernel is implemented in RTL and wrapped up in OpenCL library.

The error code try to allocate a large matrix.

cl_mem bufferMA;

sizeMA=1024*1024*512

bufferMA=clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(cl_float) * sizeMA, NULL, &status);

status = cclEnqueueWriteBuffer(queue,bufferMA,CL_FALSE,0,sizeof(cl_float) * sizeMA, MA, 0,NULL, NULL);

here MA is defined as

void * MA=(void*)aocl_utils::alignedMalloc(sizeof(cl_float) * sizeMA);

and initiated somewhere else.

the code can compile, but got error during runtime.

I think this is because the RTL kernel take the whole problem as a whole, so the dataset exceeded some limits?

--- Quote Start ---

Please post the part of your host code that is generating that error message, alongside with actual values for all parameters passed onto that function. Your very likely have a mistake somewhere in your host code.

Also you should never write your code in a way that you need to split your buffers on the host, or pass 100 parameters to the kernel. This is certainly not the correct way to write OpenCL code. If you are not familiar with OpenCL, I strongly recommend looking at some basic non-FPGA examples and write some basic OpenCL code on CPUs and GPUs first and then move onto to FPGAs. Altera also has a lot of examples here (https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-zone.html) which you can look at.

--- Quote End ---

Altera_Forum · ‎11-02-2017

Is it the "clCreateBuffer" call that is failing or the "clEnqueueWriteBuffer"? Please post the exact OpenCL error code number. Furthermore, is this the only buffer you are allocating on the device or do you also have other buffers?

If you just need to pass offset values, you can just put all the offsets in an array and pass a pointer to that array, instead of passing a pointer for each offset. Then use the offset value in the kernel to adjust your starting point for reading data from external memory and feeding it into your HDL library.