Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Altera_Forum
Honored Contributor I
1,663 Views

Report Loop Analysis NDrange threads

Hello everyone 

 

Im struggling to find what means the Details tab in Loop Analysis section of the report, saying nd-range: thread capacity = 303 

 

example 2Dimension Kernel: 

 

___Kernel(__global short const * restrict input_a,__global short const * restrict input_b) short aux; int row = get_global_id(1); int col = get_global_id(0); int width = get_global_size(0); for(int j=get_global_id(1); j < get_global_size(0); j++){ aux = input_a; input_b += aux; }  

 

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:
0 Kudos
7 Replies
Altera_Forum
Honored Contributor I
62 Views

 

--- Quote Start ---  

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused: 

--- Quote End ---  

 

 

No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.
Altera_Forum
Honored Contributor I
62 Views

 

--- Quote Start ---  

No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel. 

--- Quote End ---  

 

 

Thanks HRZ!  

 

What causes this number to be 303? Depends on the for loop condition, in the above example j=row+1 < height?
Altera_Forum
Honored Contributor I
62 Views

The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.

Altera_Forum
Honored Contributor I
62 Views

 

--- Quote Start ---  

The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline. 

--- Quote End ---  

 

 

Ok, now i get it :) 

 

Last doubt: Is the loop condition size (in other words, the number of iterations) at the compiling of the kernel "not known" affecting somehow the performance/circuit generated ? Because, with this code im assuming that certain work-items will have more iterations than others, depending on their ID on second dimension (row=get_global_id(1)).
Altera_Forum
Honored Contributor I
62 Views

Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.

Altera_Forum
Honored Contributor I
62 Views

 

--- Quote Start ---  

Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case. 

--- Quote End ---  

 

 

Thanks again for the help HRZ! You are a big help here on OpenCL section :)
Altera_Forum
Honored Contributor I
62 Views

No problem. :)

Reply