Weird arbitration logic

RJimé1 · ‎04-23-2020

Hello,

I need to pass a structure with a pointer member to my kernel. Since this cannot be done straightforwardly I'm trying two different approaches: splitting the structure in no pointer members and the pointer and leveraging an auxiliary kernel that assembles the structure (non pointers and pointers are passed to this one) and loads it in global memory, so that the actual kernel can access the structure with a pointer to global memory. Both approaches work, however, the first one gives a much better performance than the second one (in fact, the second one leads to near 1% memory efficiency, according to the Dynamic Profiler). It seems the second approach generates a highly complex arbitration logic, which leads to high latency LSUs. I'd like to know if this can be bypassed in some way, since I prefer the second approach over the first. If not, an explanation that clarifies the behaviour of the compiler will also be welcome.

A minimal example of what I'm exposing (for the sake of simplicity I omitted the kernel that assembles the struct):

typedef struct {
    int x;
    int y;
    __global int *restrict data;
} swp;
 
typedef struct {
    int x;
    int y;
} swop;
 
__kernel void struct_with_pointer(__global swp *restrict p_in, __global swp *restrict p_out) {
    swp in = *p_in;
    swp out = *p_out;
 
    for(int i = 0; i < 10; i++)
        out.data[i] = in.data[i] + 3;
}
 
__kernel void struct_without_pointer(swop in_coords, __global int *restrict data_in,
                                    swop out_coords, __global int *restrict data_out) {
    swp in = {
        .x = in_coords.x,
        .y = in_coords.y,
        .data = data_in
    };
 
    swp out = {
        .x = out_coords.x,
        .y = out_coords.y,
        .data = data_out
    };
 
    for(int i = 0; i < 10; i++)
        out.data[i] = in.data[i] + 3;
}

I'm also providing a report where you can check what I'm saying (you should compare struct_with_pointer.B2 and struct_without_pointer.B2 in the graph viewer to see the problem). I cannot give you any profiling data but what has already been stated due to timing constraints.

AnilErinch_A_Intel · ‎04-30-2020

Hi

Hope you are staying safe

Please refer the section below

8. Strategies for Improving Memory Access Efficiency

in the guide

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/opencl-sdk/aocl-best-practices-guide.pdf

As mentioned in the guide :

"If your OpenCL kernel performs a large number of memory accesses, the Intel FPGA SDK for OpenCL Offline Compiler must generate complex arbitration logic to handle the memory access requests. The complex arbitration logic might cause a drop in the maximum operating frequency (fmax), which degrades kernel performance"

There is a number of procedures to be tried to improve this, which is given same section.

Thanks and Regards

Anil