Re: [FPGA SDK for OpenCL] Problem with setting multiple compute units

wwood10 · ‎05-16-2019

I have recently been trying to compile an NDRange kernel with 4 compute units (by using the num_compute_units attribute), however when I view the report it says that the number of compute units is 1. Also, when I program my device with this kernel and run clGetDeviceInfo with CL_DEVICE_MAX_COMPUTE_UNITS, it also returns me 1. Is there something I am missing?

It seems like this is the correct way to create more compute units, and the design is taking up more FPGA resources then my previous design without the attribute, but it does not perform any different.

I have attached a screenshot of the report which also shows the source code for the kernel. Thanks for the help!

HRZ · ‎05-17-2019

The number of compute units in the report for NDRange kernels is always reported as 1. That is a bug I reported to Intel a long time ago and they confirmed it. I don't think they have fixed it yet, though.

With respect to CL_DEVICE_MAX_COMPUTE_UNITS, that attribute reflects physical compute units on the target device which will always be 1 in the case of an FPGA. The compute units created using num_compute_units are logical compute units.

Finally, your code does not use any work-groups (no get_local_id()/get_group_id()) in the code and hence, it will not benefit from compute unit replication. This feature allows multiple work-groups to run in parallel but your code only uses one work-group.

wwood10 · ‎05-17-2019

Thanks for the help.

I tried using get_local_id()/get_group_id() in a new design (which I have attached an image of the report for), however it still performs the same.

One strange thing I have noticed is that CL_DEVICE_MAX_WORK_ITEM_SIZES returns me (0,17,52) and CL_DEVICE_MAX_WORK_GROUP_SIZE returns me 2147483647. These number seem a bit strange to me.

For context I run the kernel with clEnqueueNDRangeKernel(queue_, kernel_, 1, NULL, gSize_, wgSize_, 0, NULL, NULL); where wgSize_[3] = {WORK_ITEM_SIZE, 1, 1} and gSize_[3] = {BUFFER_SIZE, 1, 1}. I assume I do not need to enqueue a command for each work group right?

HRZ · ‎05-20-2019

No, you don't need a separate queue for each work-group; everything is handled automatically. How many work-groups are you using? The guides recommends at least 3x more work-groups than compute units to see a reasonable performance benefit. Furthermore, if your application is memory unfriendly (e.g random memory accesses) or one compute unit already saturates the memory bandwidth, you are not going to see any performance benefit from using multiple compute units.

shubham10 · ‎04-08-2022

Hi,

Is there any way by which we can get/request more than one physical compute unit on the underlying FPGA chip (say S10PAC)?

Thanks

HRZ · ‎05-17-2022

What exactly are you trying to achieve by that? An FPGA design is not fixed and the underlying FPGA architecture does not have any notion of a "compute unit"; "compute unit" is simply an OpenCL terminology which doesn't necessarily map to anything meaningful on an FPGA.

You can always compile and synthesize multiple kernels into one bitstream and run them in parallel in different queues, if that is what you are trying to achieve. There are also ways to automatically create/duplicate compute units in both Single Work-item and NDRange kernels.

shubham10 · ‎05-18-2022

Hi HRZ,

Thanks for the help.

Actually one compute unit run one workgroup of the invoked kernel (say kernel1) at a time. Once it completes the execution of the workgroup, it picks the next workgroup of kernel1 from the command queue and then starts processing that.

The reason I am asking for the multiple physical compute units is that if we get multiple physical compute units on the FPGA, then we can run multiple workgroups of the same kernel in parallel (one compute unit processing one workgroup) to get the parallelism among the different workgroups of the same kernel.

Thanks

HRZ · ‎05-21-2022

That is exactly what the num_compute_units(X) attribute does. It will automatically replicate your compute unit "X" times, and distribute work-groups across them. At the same time, even one compute unit can pipeline work-items across different work-groups, so there could be more than one work-group running inside of a single compute unit at each given time. Hence, Intel recommends using 3 times more work-groups than compute units to maximize compute unit utilization. Usage of this attribute is explained here:

https://www.intel.com/content/www/us/en/docs/programmable/683846/22-1/specifying-number-of-compute-units.html

Note that this attribute will NOT change CL_DEVICE_MAX_COMPUTE_UNITS, since there is still one physical FPGA on the board.