num_compute_units effect on concurrent workgroups

Altera_Forum · ‎08-28-2017

Hello all,

I have a kernel that uses barriers, and I have been running into a problem during compilation where the compiler throws the warning

Compiler Warning: Kernel 'sync': limiting to 2 concurrent work-groups because threads might reach barrier out-of-order.

The affected area is like this:

while(tid < vert_count)
{
      status = tid;
      tid += total_threads;
}
barrier(CLK_GLOBAL_MEM_FENCE);

I suspect this has something to do with indeterminate loop bounds, but I could be wrong and am looking for some suggestions. To circumvent the error I also tried to increase the number of compute units using the num_compute_units() attribute, but that did not change the outcome. Does anyone have any insight as to why that might be? Or more broadly, how the num_compute_units() attribute affects workgroup scheduling and concurrency?

Altera_Forum · ‎08-29-2017

The warning you get is most definitely because of the thread-id-dependent loop bound which will result in threads reaching the barrier in an arbitrary order. Using num_compute_units() will not affect this issue. num_compute_unit will fully replicate the pipeline, allowing the compiler to schedule multiple work-groups in parallel, each in a different compute unit. Altera recommends having at least three times more work-groups, than there are compute units, to be able to fully utilize the circuit. From what I understand, this issue will limit the number of parallel work-groups per compute unit (each region between two barriers in the same compute unit can be occupied by a different work-group), not the total number of parallel wok-groups that are in flight in different compute units, but I could be wrong.

Altera_Forum · ‎08-29-2017

--- Quote Start ---

The warning you get is most definitely because of the thread-id-dependent loop bound which will result in threads reaching the barrier in an arbitrary order. Using num_compute_units() will not affect this issue. num_compute_unit will fully replicate the pipeline, allowing the compiler to schedule multiple work-groups in parallel, each in a different compute unit. Altera recommends having at least three times more work-groups, than there are compute units, to be able to fully utilize the circuit. From what I understand, this issue will limit the number of parallel work-groups per compute unit (each region between two barriers in the same compute unit can be occupied by a different work-group), not the total number of parallel wok-groups that are in flight in different compute units, but I could be wrong.

--- Quote End ---

I have actually tried this out and found that increasing the number of compute units does not allow for 2*num_compute_units like you suggest. Any insight as to why this might be the case?

I can also see that the problem would be with the thread id-dependent branching, I will try to address this. Is there a good way to work in the local scope like this without using the thread id to branch?

Altera_Forum · ‎08-30-2017

I am not sure what you mean by "allow for 2*num_compute_units". Using more compute units is supposed to increase performance by allowing more compute units to run in parallel. e.g., if you have 10 work-groups in your code, and each work groups takes x seconds to be processed with one compute unit, your total run time will be 10x; however, with two compute units, assuming that you have enough off-chip memory bandwidth, your run time will be reduced to ~5x.

You do not necessarily need to avoid thread-id-dependent branches; in many cases this is the only solution. That warning from the compiler is supposed to be informative, this is not something that you absolutely need to fix. It might not even result in much of a performance degradation.