Confusion about execution of NDRange kernel and single work-item kernel

Altera_Forum · ‎07-18-2017

Hello,

I'm a bit confused about the difference between the execution of NDRange kernel and single work-item kernel on FPGA.

How does each work-item in the NDRange execute in parallel on FPGA?Is it through pipeline?If it does,then the parallelism isn't a strict parallelism as a GPU does?

Would single work-item kernel execute in pipelined manner too?

Altera_Forum · ‎07-18-2017

In NDRange, threads are pipelined, in single work-item, loop iterations are pipelined. The former still has a scheduler and could issue threads out-of-order, the latter does not.

In the normal case, there will be only one pipeline in an NDRange kernel, with all the threads from all work-groups being pipeline based on the order decided by the scheduler, and two threads will NEVER be issued at the same clock, even though multiple threads can be active in different stages of the pipeline at each clock. Still, it is also possible to achieve "thread-level parallelism" by employing the SIMD or kernel pipeline replication attributes; these two attributes will allow two or more threads from the same or different work-groups to be issued onto different copies of the pipeline at the same clock.

Disclaimer: This is my personal understanding of how the compiler works, the reality could actually be different.

Altera_Forum · ‎07-19-2017

--- Quote Start ---

In NDRange, threads are pipelined, in single work-item, loop iterations are pipelined. The former still has a scheduler and could issue threads out-of-order, the latter does not.

In the normal case, there will be only one pipeline in an NDRange kernel, with all the threads from all work-groups being pipeline based on the order decided by the scheduler, and two threads will NEVER be issued at the same clock, even though multiple threads can be active in different stages of the pipeline at each clock. Still, it is also possible to achieve "thread-level parallelism" by employing the SIMD or kernel pipeline replication attributes; these two attributes will allow two or more threads from the same or different work-groups to be issued onto different copies of the pipeline at the same clock.

Disclaimer: This is my personal understanding of how the compiler works, the reality could actually be different.

--- Quote End ---

HRZ,thanks for your assistance,but I still have another questions.

Does loop unrolling contribute to loop pipelined? As I understand it,loop unrolling is just replicating the operation in one iteraion.So,for a fully unroll loop in a single work-item kernel,how could it be pipelined?

Altera_Forum · ‎07-19-2017

HRZ,Are you working for Intel Corp.?I have read the AOCL Users Guide and AOCL Best Practice Guide,but some AOCL concepts and the behavior of aoc still seem vague to me.Is there any other readings or trainning that can help me to make it clear?I'm in Peking,China.

Thanks!

Altera_Forum · ‎07-19-2017

Loop unrolling will deepen and widen the pipeline to allow multiple iterations to be handled simultaneously; of course this is only possible if there are no iteration dependencies in the loop. You only can fully unroll loops that have bounds that are known at compile time. The compiler will just create the necessary logic to handle all the unrolled iterations simultaneously and that is it. If you have an outer loop in this case, the iterations of the outer loop will still be pipelined. If, however, you just have one loop in your kernel which is fully-unrolled, then the pipeline depth will only be traversed once and execution finishes.

I do not work for Intel, I just have been working with this compiler for over 2 years, so I have got some basic idea about what is actually happening. Other than Altera's existing documents, there aren't any other solid and accurate documents. If you have access to paper publishers like IEEE/ACM/etc., there are a lot of papers to read on this subject, though. After two years, there are still a lot of stuff that even I don't fully understand; there isn't much else you can do when you are working with a closed-source commercial compiler.

Altera_Forum · ‎07-19-2017

Lots of free online training to help you out:

https://www.altera.com/support/training/catalog.html?coursetype=online&keywords=opencl

Altera_Forum · ‎07-20-2017

I am a graduated student of USTC(hefei), and now I'm doing some jobs with OpenCL SDK for FPGA. Im not sure u want to discuss this SDK with me or not, please connect me with Wechat if u like. My id is xhsygd.

Altera_Forum · ‎07-20-2017

Yes,of course.But the user of the id xhsygd do not exist.My wechat id 17888827603.Looking forward to discussing with you.

Altera_Forum · ‎07-20-2017

Thanks,but to me,the free online trainning still seem not enough to get a through understand of AOCL.

Altera_Forum · ‎07-20-2017

Thanks,HRZ.

So,in a single work-item kernel,for a loop with iteration dependencies,even with the unroll paragma before the loop,the compiler will not unroll it,right?

Altera_Forum · ‎07-20-2017

--- Quote Start ---

Thanks,HRZ.

So,in a single work-item kernel,for a loop with iteration dependencies,even with the unroll paragma before the loop,the compiler will not unroll it,right?

--- Quote End ---

It will, but you will not get an II (Initiation Interval) of one anymore, and the II will increase as you unroll more. This way, you will just end up wasting more area without getting higher performance.

Altera_Forum · ‎07-20-2017

--- Quote Start ---

It will, but you will not get an II (Initiation Interval) of one anymore, and the II will increase as you unroll more. This way, you will just end up wasting more area without getting higher performance.

--- Quote End ---

I see.Thank you very much!