Barrier State Buffer Implementation

Altera_Forum · ‎03-07-2014

Page 68 of this tutorial (http://tcfpga.org/fpga2013/opencl_tutorial.pdf) says that barriers buffer live variables of work-items. What memory is this buffer implemented in? Does it use block RAMs?

Altera_Forum · ‎03-07-2014

It'll typically be on-chip memory blocks. These buffers allow for more work-groups to be scheduled so that the compute unit can continue filling up with work to do while threads are temporarily stalled at the barrier.

Altera_Forum · ‎03-07-2014

So if I'm memory block limited, and I can't change my algorithm to use fewer (local) barriers, I can 1) shrink the work-group size so the buffers will be smaller, or 2) change my algorithm to use less local memory? Are there any other memory block pressures?

When you say typically, does that mean that registers can be used for these buffers? Or global memory?

Altera_Forum · ‎03-07-2014

Shrinking the work-group size may help but I suspect all that will do is just allow more work-items to be scheduled concurrently and then you'll be back to square one but I recommend giving it a shot. I recommend defining macros for your work-group size so that you can pass them in at kernel compile time to experiment with different sizes (using the -D<macro name>=<value> flag).

If you haven't already done so I recommend bounding the work-group size using the max_workgroup_size or reqd_workgroup_size attributes since those will impact the barrier logic size. If your work-group size is larger than 256 then you must use one of those because the default maximum work-group size is 256 unless you overwrite it with those attributes.

I said typically on-chip memory blocks are used because I avoid generalizing about what the compiler creates. I haven't come across a kernel that I suspect would use flip flops as barrier buffers but it could potentially happen. You don't really have any control over this logic, and global memory would never be used since spill over isn't implemented (it wouldn't work well on FPGAs anyway).

There might be a possibility that you can change the kernel to reduce the on-chip RAM utilization. For example, if I have a kernel that performs a lot of __local memory accesses I try to find ways to avoid accessing the memory multiple times if possible. Doing so can reduce the memory bandwidth and as a result may lead to a smaller memory footprint (remember that the hardware is generated to achieve high throughput so if bandwidth needs go down footprint may as well). Likewise sometimes you'll find some functionality that is implemented using memory that can be replaced with logic and arithmetic operations in order to trade of on-chip memory resources for logic resources (ALUTs). These are just generalizations and every kernel is different so without seeing the code I'm not sure I'll be much help with my suggestions.

Altera_Forum · ‎03-07-2014

Thanks BadOmen! That was incredibly helpful.