- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a very simple code with a few lines, but the outer loop is not pipelined due to "Loop exit condition unresolvable at iteration initiation". Inner loop is pipelined well with II=1. Can anybody suggest any idea? Thanks __attribute__ ((task)) kernel void compute_BFS0( __global const unsigned* restrict ovid_of_edge, __global const unsigned* restrict start_edge, __global const unsigned* restrict end_edge, __global unsigned* restrict node_data ) { unsigned ei; unsigned si; unsigned ovid; for (unsigned i = 0; i < 1000; i++ ) // iterates over graph nodes { si = start_edge; // sequential readei = end_edge; // sequential read for(unsigned j = si; j < ei; j++) // iterates over node's outgoing edges { ovid = ovid_of_edge[j]; // child node. sequential read. node_data[ovid] = 1000; // random-access write } } } //kernel
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your inner loop has a variable exit condition; hence, the outer loop is not pipelineable. If ei has a maximum limit, you can replace ei in the header of the inner loop with that maximum limit, and add a branch inside of the inner loop to discard cases where j >= ei. That will allow pipelining of both loops, at cost of redundant computation which could be significant depending on how far ei is from that maximum. If there is no maximum or it is too large, then an NDRange kernel would be more appropriate since the thread scheduler can minimize pipeline stalls bubbles at run-time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the problem is with variable loop limits of inner loop, I wonder why below code doesn't have this problem too, and both loops are pipelined (according to report).
In first problematic code (mentioned earlier), in a graph, a constant value is written to children nodes of every parent node, but in below code, value of children nodes are read, and their summation is written to their parent node. In earlier code, there is a random-access write, however in second one, we have random-access read. Can this be the source of pipelining problem? __attribute__ ((task)) __kernel void compute_pagerank( __global const unsigned* restrict ovid_of_edge, __global const unsigned* restrict start_edge, __global const unsigned* restrict end_edge, __global unsigned* restrict node_data, __global unsigned* restrict node_data2 ) { unsigned acc = 0; unsigned ei; unsigned si; unsigned ovid; for (unsigned i = 0;i < 1000; i++ ) // iterates over graph nodes { acc = 0; si = start_edge; // sequential readei = end_edge; // sequential read for(unsigned j = si; j < ei; j++) // iterates over node's outgoing edges { ovid = ovid_of_edge[j]; // child node. sequential read. acc += node_data[ovid]; // random-access read } node_data2[i] = acc; // sequential write } } //kernel thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am genuinely surprised to the that the second kernel is actually pipelineable. Other than the difference in read and write that you mentioned, I don't see any other difference between the two kernels that could cause the difference. However, I would say the strange case here is the second kernel, not the first one. I checked with newer versions of the compiler. 16.1.2 and 17.0 fully pipeline the second kernel, but 17.1 and 18.0 say:
--- Quote Start --- II >=1 II is an approximation due to variable inner loop trip count. --- Quote End --- I guess the compiler is doing some corner case optimization that allows pipelining in the second kernel. Other than that, I have no idea. P.S. You can put your code segments in a CODE tag to preserve indentation.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tested both of your suggestions:
1- inner loop with fixed limit. Now it is pipeliable, but increase in run-time, due to wasted clocks in inner loop, is large. Because my upper limit is huge. 2- I tested the ND-range. That really worked well! I don't know exactly what is the difference between implementation of pipelinging inside a task, or pipelining among work-items inside a ND-range, however, this is a good example for me that they are not similar. Thank you very much for following this discussion. that was really instructive. :)- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- 2- I tested the ND-range. That really worked well! I don't know exactly what is the difference between implementation of pipelinging inside a task, or pipelining among work-items inside a ND-range, however, this is a good example for me that they are not similar. --- Quote End --- The difference is that the run-time scheduler in the NDRange version will avoid all the redundant computation. NDRange is preferred for unpipelineable loops because of this (while single work-item is preferred for pretty much every other case).
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page