- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone
Im struggling to find what means the Details tab in Loop Analysis section of the report, saying nd-range: thread capacity = 303 example 2Dimension Kernel:___Kernel(__global short const * restrict input_a,__global short const * restrict input_b)
short aux;
int row = get_global_id(1);
int col = get_global_id(0);
int width = get_global_size(0);
for(int j=get_global_id(1); j < get_global_size(0); j++){
aux = input_a;
input_b += aux;
}
I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused:
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I only can run a maximum of 303 work-items? And what causes that "constraint" on the code ? :confused: --- Quote End --- No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- No, that is not what this means. That number is just the latency of the pipeline which determines the maximum number of threads that can be simultaneously "in-flight" in the pipeline. This does not limit the total number of work-items you can run with that kernel. --- Quote End --- Thanks HRZ! What causes this number to be 303? Depends on the for loop condition, in the above example j=row+1 < height?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- The latency of the pipeline depends on the operations that are carried out inside of the loop, and the loop exit condition. Each iteration of your loop performs one external memory read, one external memory write, multiple integer operations for addressing and calculation, and one final integer comparison for the loop exit condition. Based on the compiler's decision, these operations require a minimum of 303 clocks to be carried out, without stalling the pipeline. --- Quote End --- Ok, now i get it :) Last doubt: Is the loop condition size (in other words, the number of iterations) at the compiling of the kernel "not known" affecting somehow the performance/circuit generated ? Because, with this code im assuming that certain work-items will have more iterations than others, depending on their ID on second dimension (row=get_global_id(1)).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Thread-id-dependent branching will have a negative performance impact, because run time will be dominated by threads that perform the most amount of work. Furthermore, having such branches in your code will prevent you from being able to use the SIMD attribute to increase the performance of your kernel. However, the runtime scheduler will try to minimize the number of pipeline stalls/bubbles and maximize performance in any case. --- Quote End --- Thanks again for the help HRZ! You are a big help here on OpenCL section :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No problem. :)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page