Channels vs looped memory accesses

Altera_Forum · ‎05-29-2014

With respect to channels, I'm trying to understand why the "channelized" version of simple single-stream I/O is beneficial over looped accesses to memory. If I have a single work-item kernel or work/group, wouldn't this still end up as a single stage of the pipelined kernel? So are channels really only beneficial when used by multiple simultaneous kernels?

Altera_Forum · ‎05-29-2014

Do you have some code that you can post to put your question into context? To answer your last question the only time I use channels is to communicate to other kernels. If you create a channel where the source and destination are the same kernel you could end up causing scheduling problems since the compiler will not analyze the latency through the channel.

Sometimes a kernel will have dependencies that prevent it from operating optimially. For example if you have a single work-item kernel that reads from memory, performs some operations on that data, then writes the results back to memory, and lets say the operations on the data are non-determistic in terms of processing time. In cases like that you might want to improve the performance by replicating the code but it might not be beneficial to replicate all of the code reponsible for the reading and writing to memory. So you could replicate part of the kernel by putting it into another kernel and adding the replication to only that kernel and use channels to push data into that kernel and pull the results back out. Channels let you decouple the kernel providing the data from the kernel receiving the data and when to use that is highly dependent on the algorithm being accelerated (i.e. there is no general rule of thumb when to use channels).

Altera_Forum · ‎05-29-2014

In particular, I'm referring to Figures 3/4 from this paper: http://ieee-hpec.org/2013/index_htm_files/29-high-performance-settle-2876089.pdf

I understand this is just an illustrative example, and I think you've answered my question. That is, channels don't add much (any) functionality for simple I/O but do allow you to replicate portions of code which may have lower throughput.

Thanks!

Altera_Forum · ‎05-29-2014

You are correct, there wouldn't be any benefit of using channels in that particular example. The writer is just using it as an example to show how to take a very simple NDRange kernel that just copies a buffer and showing the functional equivalent using two kernels, one that reads from memory and stuffs the data into a channel, while the other pulls the data out of the channel and places it into memory. Note that the code is fairly old and it contains an attribute that isn't supported called autorun. If I was to code the example in figure 4 with the tools today I would just have two task kernels (single work-item execution) with a loop that dictates how many times memory is read/written to.

I recommend taking a look at the design examples here that mention channels in the feature column: http://www.altera.com/support/examples/opencl/opencl.html The ones that use channels do so for very good reasons that you can learn from to see if it's doing something similar with your own kernels.

Altera_Forum · ‎05-30-2014

I should mention that there are a few small typos in Fig. 6 of that paper. The fifth line should read "int Sd_private[2][N];", the eleventh line "int Sd = Sd_private[0][j];", and the sixth line from the bottom "Sd_private[1][j + 1] = S;". I had copied and pasted from the channel example in Fig. 5 and forgot to modify for the channel depth needing to be manually managed using shift registers. The current preferred implementation should follow Fig. 6 rather than Fig. 7.

Btw, thanks BadOmen for accurately responding to this post.

Altera_Forum · ‎05-30-2014

Sean,

Thanks for the clarifications. That makes the "shift register" implementation much clearer. I've also been looking at the tdfir example optimization (http://www.altera.com/support/examples/download/exm_opencl_tdfir_optimization_guide.pdf) slides, which explains this optimization in more detail.

Thanks again BadOmen and sean.settle for your help!

Altera_Forum · ‎06-04-2014

Sean,

When you say the preferred implementation should follow Fig. 6 (single work-item execution with unrolled inner loop using shift registers) vs. Fig. 7 (an NDRange implementation with outer loop distributed among work-items) why is this the case? Is this a programmability consideration? Or are there performance implications?

Altera_Forum · ‎06-06-2014

Hi Jack,

I was told that by our compiler team and the programming guide says there may be potential performance implications depending on the kernel. Both approaches are functionally identical.

Do you have a preference for one approach over the other, and if so, could you explain why?

Altera_Forum · ‎06-11-2014

Hello Jack and Sean,

I used both the NDRange and shift register implementations for a 7x7 convolution.

To obtain acceptable performance (my goal was 1 work item per clock cycle), the NDRange implementation required the use of the local memory, which implies copying a portion of the image + the 3 neighboring rows above the portion, the 3 neighboring rows below the portion, the 6 neighboring rows on the left and right of the portion, for every workgroup. That creates a lot of redundancy, and uses a lot of local memory.

The shift registers implementation only needs one buffer to hold 6 lines and 7 pixels of the image, which uses less resources for a slightly better throughput.

Hope this helps.

Regards

koper

Altera_Forum · ‎06-12-2014

This is very interesting, thanks for the reply koper.

I'm still gaining familiarity with the single work-item/pipelined programming model and currently I don't have a preference between NDRange vs. Pipelined Task from a programmability point of view. I'm coming from the GPU side of things and so thinking about coding implicit pipeline parallelism rather than explicit work-group/work-item parallelism was difficult at first. It's definitely a much cleaner way to program though now that I understand how to create shift registers. Koper has a good point though that because OpenCL does not let work-items share private memory, you either have to use the task model if you want to take advantage of the large number of registers on chip for inter-stage communication of tokens or use channels.

Altera_Forum · ‎06-14-2014

Hi Jack,

I want to expand upon your last sentence. You can use Task kernel methods such as shift registers within an NDRange kernel, since a Task kernel is just a single work-item single work-group NDRange kernel. For example, if you wanted to do a moving average filter, you could perform divide and conquer on the input domain and have each work-item in a work-group process their (overlapping) subdomains using their own shift registers. You could scale your throughput easily by specifying how many work-items are in a work-group using the __attribute__ ((reqd_work_group_size(X, 1, 1))) until you consume some desired portion of your device's resources (logic, registers, memory bandwidth, etc.). I recommend giving it a try and see how it works for your problems.