So, I am trying to perform some operation inside an OpenCL kernel. I have this buffer named filter which is a 3x3 matrix initialized with value 1.
I pass this as an argument to the OpenCL kernel from the host side. The issue is when I try to fetch this buffer on the device side as a float3 vector. For ex -
__kernel void(constant float3* restrict filter)
float3 temp1 = filter;
float3 temp2 = filter;
float3 temp3 = filter;
The first two temp variables behave as expected and have all their value as 1. But, the third temp variable (temp3) has only the x component as 1 and rest of the y and z components are 0. When I fetch the buffer as only a float vector, everything behaves as expected. Am I doing something wrong? I don't want to use vload instructions as they give an overhead.
Hi Abhi Verma,
Thanks for the interest.
Can you attach a minimally representative reproducer for your observed issue? I took a look at the relevant OCL standard documentation and didn't see anything immediately obvious.
I'd like to see how the kernel is ingesting the constant data. Can you send the host side and the device side program? Can you send build options as well? How are you displaying the data?
What hardware target is this executing on? Which OpenCL implementation is in use (from Intel Graphics Driver version (Windows* OS) or NEO/CPU RT version (Linux* OS))?
Vec3s in OpenCL C are a bit strange - they are three component vectors, but the size of a vec3 is the same as the size of a vec4, and the alignment requirements for a vec3 are the same as alignment requirements for a vec4. See:
If your matrix data is tightly packed then you'll want to use the vload3 and vstore3 built-ins to load and store your data instead. Can you say anything more about the overhead you are seeing? Of the top of my head I can't think of any reason why there should be overhead when loading float data with vload/vstore vs. regular pointer-based loads and stores. Thanks!