Hello everyone!There has been a problem puzzling me since I come into contact with the opencl for FPGA. As I know , the "work item"--one of the basic term in opencl model, correspond to a processing element in gpu. And the "work Group" is for a computing unit. But in FPGA , what does the "work item" correspond to ? Logic Element, DSP block or others? The hardware circuit generated by OPENCL SDK is complex and big , i can’t understand why the kernel can be ranged as a single work-item.
FPGA hardware is not fixed and the resulting circuit can be vastly different depending on the kernel. Regardless of the kernel programming type (NDRange or single work-item), one or multiple pipelines are generated by Altera's compiler and in each case, either the work-items (for NDRange) or loop iterations (for single work-item) are considered as inputs to the pipeline(s), and are issued in some order and interval decided by the compiler. You can think of single work-item kernels as NDRange kernels that have been wrapped inside a for loop from 0 to global_size; in other words, in single work-item kernels, the loop iterations play a role similar to work-items in NDRange kernels.
OH ,thank you! I got a lot.In addition, when i get the device info of "de1-soc" ,the "cl_max_compute_unit" is 1. However, as we know the phrase"__attribute(num_compute_unit())" can be used to set the num of used compute_unit(). The resource usage is surely changed when i increased the num_compute_unit, though with a little performence decreased. So it puzzled me that if there is a restrict with the num_compute_unit for FPGA ,or the num of compute_units (work groups) just depends on the kernel code and the phrase"__attribute(num_compute_unit())".
The number of "compute units" in the context of the OpenCL standard, and "compute units" in the context of Altera's compiler, despite having the same name, are two completely different things. I haven't tried this personally, but I believe querying the number of compute units using the respective OpenCL function for FPGAs will always return 1, since there are no fixed and pre-defined compute units on the FPGA, unlike a standard GPU or CPU. Though if this number is queried after the FPGA is programmed with the kernel, the function might report the correct number (though it probably won't).There is no restriction other than the limited FPGA area, for using multiple compute units by adding __attribute__(num_compute_unit()) to the kernel. Still, achieving performance improvement by using this attribute requires at least two conditions to be met: 1- Altera recommends having at least 3 times more work groups than compute units, to be able to fully utilize the circuit. Less work groups will probably not result in much of a performance improvement. 2- Since having multiple compute units results in more memory ports and significant memory contention between the units, the memory bandwidth requirement of the units in total should be relatively lower than the off-chip memory bandwidth to achieve speed-up. For example, if one compute unit is already memory-bound on the FPGA, using more units not only will not improve performance, it will also decrease it.
I would encourage you to look up two System Verilog files AOC -c generates from your kernel in the text editor. Their names are <kernel>_system.v and <kernel>.v. The former file will have all compute units ( or instances ) of the pipelined kernel FPGA logic (I use logic name here because this is what you get at the end) and the latter file is the kernel logic itself. It is worth while to learn the entire process of building a custom parallel computing machine for your code - your OpenCL host/device code. The beauty of the FPGA is that you will never be limited to CPU architecture that was conceived in 1945 or to latest GPU silicon optimized only for the latest hype like CNN, and I don't mean "fake news". You can build any system to do anything with FPGAs. But with GPUs and CPUs will be limited the most of the time because they are a fixed logic silicon.As far as the rest of the bugs Intel/Altera API has, I would recommend to open Service Request at myAltera to point out to them 1 returned by the API call for multiple compute units. Doing so will make OpenCL on FPGA more useful for everyone.