Re: Report Loop Analysis NDrange threads

Altera_Forum · ‎10-13-2017

Hello everyone

Im struggling to find what means the Details tab in Loop Analysis section of the report, saying nd-range: thread capacity = 303

example 2Dimension Kernel:

___Kernel(__global short const * restrict input_a,__global short const * restrict input_b)
short aux;
int row = get_global_id(1);
int col = get_global_id(0);
int width = get_global_size(0);
for(int j=get_global_id(1); j < get_global_size(0); j++){
    
        aux = input_a;
        input_b += aux;
    }

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:

Altera_Forum · ‎10-14-2017

--- Quote Start ---

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:

--- Quote End ---

No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.

Altera_Forum · ‎10-14-2017

--- Quote Start ---

No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.

--- Quote End ---

Thanks HRZ!

What causes this number to be 303? Depends on the for loop condition, in the above example j=row+1 < height?

Altera_Forum · ‎10-14-2017

The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.

Altera_Forum · ‎10-14-2017

--- Quote Start ---

The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.

--- Quote End ---

Ok, now i get it :)

Last doubt: Is the loop condition size (in other words, the number of iterations) at the compiling of the kernel "not known" affecting somehow the performance/circuit generated ? Because, with this code im assuming that certain work-items will have more iterations than others, depending on their ID on second dimension (row=get_global_id(1)).

Altera_Forum · ‎10-14-2017

Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.

Altera_Forum · ‎10-14-2017

--- Quote Start ---

Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.

--- Quote End ---

Thanks again for the help HRZ! You are a big help here on OpenCL section :)

Altera_Forum · ‎10-15-2017

No problem. :)