Advice on how to handle structs with huge arrays

Edgardo_Doerner · ‎01-25-2017

Dear all,

currently I am working on a Monte Carlo code for particle transport simulations using OpenCL and I am facing a problem with the size of some arguments given to the OpenCL kernel. For example, to store some data from the simulated geometry I use the following struct.

typedef struct ALIGNED(ALIGNMENT) region_data_t {
    // reg = 0, front of geometry, reg = MXREG+1, back of geometry
    cl_float rhof[MXREG + 2];
    cl_float pcut[MXREG + 2];
    cl_float ecut[MXREG + 2];
    cl_int med[MXREG + 2];
    cl_int flags[MXREG + 2];
   
} region_data_t;

I use the same definition on the host and on the device (I have some definitions to change the cl_* types to the "normal" ones). The fact is that MXREG, the maximum number of regions allowed in the problem could be quite large, and therefore I generally reach the stack limit of my OS. I can handle that giving the "ulimit -s hard" command, but it is clear that it is not the ideal case.

So the question is, how would you handle this kind of struct?. I could just dynamically allocate all the arrays and pass them separately to the kernel, but it would be nice to maintain the structs use inside my code. I have a couple more of such structs and the number of arguments could rapidly increase. Thanks for your help!.

Jeffrey_M_Intel1 · ‎01-29-2017

I'm still checking on the implications of using structs as parameters. So far I have had the best experience sticking to standard types for kernel parameters, but this may just be me being conservative.

Of course the types used for your host code, kernel parameters, and kernel code do not need to match. As long as the data is contiguous each work item can calculate offsets and convert types to get the results you want. For example, it is common for host code to have float data and the kernel parameters could be float4. You pass addresses to the work items through the kernel parameter list. What each work item does to calculate the offset to those pointers for what goes in and out is up to your implementation.

The main concerns I know of are to make sure that the host side addresses for each member buffer are aligned and that you meet the other criteria for zero copy. Dynamic aligned allocation for your member buffers could help you make sure that data I/O is efficient.

Edgardo_Doerner · ‎01-30-2017

Thanks for the advice, I think that at least for the huge structures I will stick to standard types.

About the zero copy property, I have tested some of the Intel OpenCL examples (as the MultiDeviceBasic) and I have a question, how one can be sure that the zero-copy behavior is enabled?. And how this technique affects the execution time if I use other devices, such as AMD or Nvidia GPUs?

Thanks for your help!.

Michal_M_Intel · ‎01-30-2017

One way to check if your OpenCL buffer has zero copy property is to use driver diagnostics extension.

Here is sample code

https://software.intel.com/en-us/articles/application-performance-using-intel-opencl-driver-diagnostics-sample-users-guide

And here is extension spec:

https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_driver_diagnostics.txt

In this case clCreateBuffer will provide a GOOD diagnostic, indicating that zero copy is happening.

Something like that:

"Performance hint: clCreateBuffer with pointer 30d5000 and size 4096 meets alignment restrictions and buffer will share the same physical memory with CPU."