Re: Logic utilization of hello-world vector add

Altera_Forum · ‎05-23-2017

Hello. I recently compiled the vector add hello world example which comes with the Terasic DE1 BSP (Cyclone V). The report file, which was generated during the compilation of the kernel says that already 15% of my logicelements are used:

logic utilization: 5,058 / 32,070 ( 15 % )

Did I get this right? So my options on writing OpenCL Kernels is quite limited on the Cyclone V if a single Vector Add-Kernel need that much elements?

Am I right, that this reportfile generated during kernel compilation has nothing todo with the size of the vectors specify by the hostprogramm (N=1000000)?

Thanks :)

Altera_Forum · ‎05-23-2017

Cyclone V is a very small FPGA, you shouldn't expect to be able to fit a large design on it, even with HDL.

In the particular case of OpenCL, the BSP which holds the necessary IP Cores and logic for host/device communication and the OpenCL runtime uses quite a bit of area on its own, but this amount is fixed regardless of your kernel. In your case, the BSP is probably using 10% of that 15% logic and is actually bigger than the vector add kernel itself.

The input size in host code does not affect area utilization; the only thing that affects area utilization on the FPGA is the kernel code and the BSP.

Altera_Forum · ‎05-23-2017

Okay thanks, your answer brougth up a new question:

What happens with the size of the vector in the host-program?

I thought selecting global_work_size = N runs the Kernel N-times in parallel. But when programming the .aocx file on the FPGA only one instance (one adder) of the vector add is build?

Altera_Forum · ‎05-23-2017

To put it simply (and maybe inaccurately to some extent), there is no thread-level parallelism on the FPGA unless you use SIMD. In other words, as you said, there will be only one adder on the FPGA, with work-items "looping" over that adder. Of course the adder is pipelined and hence, multiple work-items (threads) can be populating different stages of the adder at the same time, and speed-up will be achieved using pipelining instead of thread-level parallelism. With SIMD (applied to the kernel), however, there will be multiple adders and multiple work-items actually running in parallel. I recommend reading the first section of the "Intel® FPGA SDK for OpenCL Best Practices Guide" for more info on this specific subject.

Altera_Forum · ‎05-24-2017

Thanks for pointing out what to read.

So if I use the vector_Add example with eg. 1024 elements

and put follwing attributes to the kernel:

__attribute__((num_simd_work_items(4)))

__attribute__((reqd_work_group_size(256,1,1)))

__kernel void vector_add ...

there would run 4 kernels in parallel with 256 Work-Items processed in each of them?

Need the simd factor * work_group_size = elements?

On the intel-video guide: Writing OpenCL Programms for Intel FPGAs the executionmodel is devided into single work-item excecution (with loop-pipelining) and ndrange kernels:

So if i want to use the dataparallel NDRange excecution i have to work with clenqueuendrangekernel + simd?

How can i "tell" the kernel that i want to use it as a Single Work-Item Excecution? (Video sayswith NDRange of (1,1,1), but i didn´t managed to get it work that way )

Whats the difference between the single Work-Item excecution which is set with the NDRange of (1,1,1) and the excecution from the vector_add example.

Sorry for asking again, i tried to figure it out the entire day but couldn´t quite suceed ./

Thanks :)

Altera_Forum · ‎05-24-2017

With SIMD of 4, you will get one copy of the kernel (not 4) but with up to 4 threads being issued in the same clock. To have 4 kernel copies and 256 work-items running in each copy, you should replicate your kernel pipeline by using __attribute__((num_compute_units(4))). Needless to say, SIMD is faster and more area-efficient than kernel pipeline replication, and the latter should be avoided unless you cannot use SIMD or you have already used the maximum-allowed SIMD value but still have area left on the FPGA. The "Intel® FPGA SDK for OpenCL Best Practices Guide, Section 1.7.3" discusses this subject.

For single work-item, if you don't use functions like get_local_id(), get_global_id() and get_group_id(), i.e. your kernel is work-group and work-item-invariant, the compiler will automatically compile it as single work-item; anything else will be an NDRange kernel You can also save a small amount of area by adding max_global_work_dim(0) to single work-item kernels to remove the scheduler.

In NDRange kernels, threads or work-items are scheduled onto the kernel pipeline by the runtime scheduler in a way that keeps the pipeline as busy as possible; this could also involve out-of-order execution. Obviously, no dependencies should exist between threads in this case and data sharing between threads can only be done by using local memory and barriers. In single work-item, however, there is no scheduler involved anymore and loop iterations are issued onto the kernel pipeline by an Initiation Interval that depends on dependencies between the loop iterations.

I Strongly recommend fully reading both the "Intel® FPGA SDK for OpenCL Programming Guide" and "Intel® FPGA SDK for OpenCL Best Practices Guide"; most common questions are answered in those documents.