Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17268 Discussions

Channels vs looped memory accesses

Altera_Forum
Honored Contributor II
2,461 Views

With respect to channels, I'm trying to understand why the "channelized" version of simple single-stream I/O is beneficial over looped accesses to memory. If I have a single work-item kernel or work/group, wouldn't this still end up as a single stage of the pipelined kernel? So are channels really only beneficial when used by multiple simultaneous kernels?

0 Kudos
10 Replies
Altera_Forum
Honored Contributor II
912 Views

Do you have some code that you can post to put your question into context? To answer your last question the only time I use channels is to communicate to other kernels. If you create a channel where the source and destination are the same kernel you could end up causing scheduling problems since the compiler will not analyze the latency through the channel. 

 

Sometimes a kernel will have dependencies that prevent it from operating optimially. For example if you have a single work-item kernel that reads from memory, performs some operations on that data, then writes the results back to memory, and lets say the operations on the data are non-determistic in terms of processing time. In cases like that you might want to improve the performance by replicating the code but it might not be beneficial to replicate all of the code reponsible for the reading and writing to memory. So you could replicate part of the kernel by putting it into another kernel and adding the replication to only that kernel and use channels to push data into that kernel and pull the results back out. Channels let you decouple the kernel providing the data from the kernel receiving the data and when to use that is highly dependent on the algorithm being accelerated (i.e. there is no general rule of thumb when to use channels).
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

In particular, I'm referring to Figures 3/4 from this paper: http://ieee-hpec.org/2013/index_htm_files/29-high-performance-settle-2876089.pdf 

 

I understand this is just an illustrative example, and I think you've answered my question. That is, channels don't add much (any) functionality for simple I/O but do allow you to replicate portions of code which may have lower throughput. 

 

Thanks!
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

You are correct, there wouldn't be any benefit of using channels in that particular example. The writer is just using it as an example to show how to take a very simple NDRange kernel that just copies a buffer and showing the functional equivalent using two kernels, one that reads from memory and stuffs the data into a channel, while the other pulls the data out of the channel and places it into memory. Note that the code is fairly old and it contains an attribute that isn't supported called autorun. If I was to code the example in figure 4 with the tools today I would just have two task kernels (single work-item execution) with a loop that dictates how many times memory is read/written to. 

 

I recommend taking a look at the design examples here that mention channels in the feature column: http://www.altera.com/support/examples/opencl/opencl.html The ones that use channels do so for very good reasons that you can learn from to see if it's doing something similar with your own kernels.
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

I should mention that there are a few small typos in Fig. 6 of that paper. The fifth line should read "int Sd_private[2][N];", the eleventh line "int Sd = Sd_private[0][j];", and the sixth line from the bottom "Sd_private[1][j + 1] = S;". I had copied and pasted from the channel example in Fig. 5 and forgot to modify for the channel depth needing to be manually managed using shift registers. The current preferred implementation should follow Fig. 6 rather than Fig. 7. 

 

Btw, thanks BadOmen for accurately responding to this post.
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

Sean, 

 

Thanks for the clarifications. That makes the "shift register" implementation much clearer. I've also been looking at the tdfir example optimization (http://www.altera.com/support/examples/download/exm_opencl_tdfir_optimization_guide.pdf) slides, which explains this optimization in more detail. 

 

Thanks again BadOmen and sean.settle for your help!
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

Sean, 

 

When you say the preferred implementation should follow Fig. 6 (single work-item execution with unrolled inner loop using shift registers) vs. Fig. 7 (an NDRange implementation with outer loop distributed among work-items) why is this the case? Is this a programmability consideration? Or are there performance implications?
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

Hi Jack, 

 

I was told that by our compiler team and the programming guide says there may be potential performance implications depending on the kernel. Both approaches are functionally identical. 

 

Do you have a preference for one approach over the other, and if so, could you explain why?
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

Hello Jack and Sean, 

 

I used both the NDRange and shift register implementations for a 7x7 convolution. 

To obtain acceptable performance (my goal was 1 work item per clock cycle), the NDRange implementation required the use of the local memory, which implies copying a portion of the image + the 3 neighboring rows above the portion, the 3 neighboring rows below the portion, the 6 neighboring rows on the left and right of the portion, for every workgroup. That creates a lot of redundancy, and uses a lot of local memory. 

 

The shift registers implementation only needs one buffer to hold 6 lines and 7 pixels of the image, which uses less resources for a slightly better throughput. 

 

Hope this helps. 

Regards 

 

koper
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

This is very interesting, thanks for the reply koper.  

 

I'm still gaining familiarity with the single work-item/pipelined programming model and currently I don't have a preference between NDRange vs. Pipelined Task from a programmability point of view. I'm coming from the GPU side of things and so thinking about coding implicit pipeline parallelism rather than explicit work-group/work-item parallelism was difficult at first. It's definitely a much cleaner way to program though now that I understand how to create shift registers. Koper has a good point though that because OpenCL does not let work-items share private memory, you either have to use the task model if you want to take advantage of the large number of registers on chip for inter-stage communication of tokens or use channels.
0 Kudos
Altera_Forum
Honored Contributor II
912 Views

Hi Jack, 

 

I want to expand upon your last sentence. You can use Task kernel methods such as shift registers within an NDRange kernel, since a Task kernel is just a single work-item single work-group NDRange kernel. For example, if you wanted to do a moving average filter, you could perform divide and conquer on the input domain and have each work-item in a work-group process their (overlapping) subdomains using their own shift registers. You could scale your throughput easily by specifying how many work-items are in a work-group using the __attribute__ ((reqd_work_group_size(X, 1, 1))) until you consume some desired portion of your device's resources (logic, registers, memory bandwidth, etc.). I recommend giving it a try and see how it works for your problems.
0 Kudos
Reply