Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16592 Discussions

Report Loop Analysis NDrange threads

Altera_Forum
Honored Contributor II
2,078 Views

Hello everyone 

 

Im struggling to find what means the Details tab in Loop Analysis section of the report, saying nd-range: thread capacity = 303 

 

example 2Dimension Kernel: 

 

___Kernel(__global short const * restrict input_a,__global short const * restrict input_b) short aux; int row = get_global_id(1); int col = get_global_id(0); int width = get_global_size(0); for(int j=get_global_id(1); j < get_global_size(0); j++){ aux = input_a; input_b += aux; }  

 

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:
0 Kudos
7 Replies
Altera_Forum
Honored Contributor II
477 Views

 

--- Quote Start ---  

I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused: 

--- Quote End ---  

 

 

No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.
0 Kudos
Altera_Forum
Honored Contributor II
477 Views

 

--- Quote Start ---  

No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel. 

--- Quote End ---  

 

 

Thanks HRZ!  

 

What causes this number to be 303? Depends on the for loop condition, in the above example j=row+1 < height?
0 Kudos
Altera_Forum
Honored Contributor II
477 Views

The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.

0 Kudos
Altera_Forum
Honored Contributor II
477 Views

 

--- Quote Start ---  

The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline. 

--- Quote End ---  

 

 

Ok, now i get it :) 

 

Last doubt: Is the loop condition size (in other words, the number of iterations) at the compiling of the kernel "not known" affecting somehow the performance/circuit generated ? Because, with this code im assuming that certain work-items will have more iterations than others, depending on their ID on second dimension (row=get_global_id(1)).
0 Kudos
Altera_Forum
Honored Contributor II
477 Views

Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.

0 Kudos
Altera_Forum
Honored Contributor II
477 Views

 

--- Quote Start ---  

Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case. 

--- Quote End ---  

 

 

Thanks again for the help HRZ! You are a big help here on OpenCL section :)
0 Kudos
Altera_Forum
Honored Contributor II
477 Views

No problem. :)

0 Kudos
Reply