Slowdown when increasing the number of compute units

Altera_Forum · ‎04-11-2014

Hi,

I've recently started experimenting with increasing the number of compute units for a particular kernel. I've previously had some good results using manual vectorisation to reduce runtime.

However when I increase the number of compute units using __attribute__((num_compute_units(2))), the kernel runtime slows down by ~30%.

If the kernel is bottlenecking on global memory b/w I was expecting performance to at worst remain the same (is this a valid assumption?). However I don't think that it is fully maxing out the global memory bandwidth as increasing the manual vectorisation level further does still improve performance.

I'd therefore be grateful for any assistance on what might be causing this?

Looking at the logic utilisation of both kernels the one that uses the 2 compute units does use more logic: 54% vs 39%.

I also notice that when querying the device the max number of compute units (CL_DEVICE_MAX_COMPUTE_UNITS) is returned as 1. The device in question is a pcie385n_d5 from Nallatech.

Does the max number of compute units refer to the same things as "num_compute_units"? And is there any way to confirm that the compiler is actually creating more compute units?

Could the workgroup size that I am launching the kernel with have an affect on performance. eg is there a minimum size which I should specify for this sort of setup? I'm just wondering is this could affect how the 2 compute units are accessing memory eg causing bank conflicts or similar. Could specifying larger workgroup sizes help?

Many thanks

Altera_Forum · ‎04-11-2014

If a single compute unit was bottlenecked by global memory, by putting two into the hardware the access pattern between the two compute units can reduce the memory efficiency. When you have multiple compute units you end up duplicating load/store units that access memory. SDRAM operates most efficiently when accessed sequentially so if you have two compute units accessing different regions of memory then a less sequential memory access pattern will be the result. Vectorizing the kernel may improve performance simply because narrow accesses will become coalesced

I have not queried the FPGA to determine things like the maximum number of compute units so I'm not sure if returning a 1 is the excepted behavior when you instruct the compiler to create 2. There is no minimum workgroup size but if you don't launch the hardware with large number of work-items you may run into performance issues (hard to tell without seeing the kernel). The maximum work-group size by default is 256 so if you have a different maximum size (or a fixed size) there are attributes that you can set for those.

Instead of manually vectorizing the kernel have you tried to use the num_simd_work_items attribute? It vectorizes the kernel for you instead of you having to manually change all your data to vector types. There will still be cases where manual vectorization is ideal but when you are prototyping I recommend giving it a shot since you can change the vector size quickly using it.

Altera_Forum · ‎04-13-2014

Hi, for the 2 compute units, the global work item size to local work item size ratio should be 8, or larger, for example, 1024/128.

Altera_Forum · ‎04-17-2014

Hi wzk6_3_8, badomen,

many thanks for the feedback.

wzk6_3_8, could you possibly clarify why this is the case?

Altera_Forum · ‎04-18-2014

There is no general rule of thumb of global to local size.

My recommendation for you is to evaluate how to reduce your global memory bandwidth requirements as well as determine if the access pattern to global memory is efficient. For example if you load the same data from global memory multiple times attempt to store that data to local memory first so that you can re-use it onchip where there is more memory bandwidth. If your kernel doesn't access global memory sequentially that can also cause performance problems. For example this would be a good access pattern:

data = some_global_location [get_global_id(0)];

This would be a bad access pattern from a bandwidth perspective:

data = some_global_location [some_variable % 5];

Altera_Forum · ‎04-25-2014

--- Quote Start ---

Hi wzk6_3_8, badomen,

many thanks for the feedback.

wzk6_3_8, could you possibly clarify why this is the case?

--- Quote End ---

The first compute unit will be dispatched 4 work groups, then the second compute unit has the chance to get work group.

Altera_Forum · ‎04-25-2014

The scheduler dispatches on workgroup units and it's based on compute unit availablity. The ordering of the scheduling to each compute unit is not easily predicable for users so I wouldn't count on any ordering. Expect the scheduler to do a good job keeping the compute units fed with work.

Altera_Forum · ‎05-08-2014

We can modify the IP file about the dispatch module to achieve relatively fair schedule.

--- Quote Start ---

The scheduler dispatches on workgroup units and it's based on compute unit availablity. The ordering of the scheduling to each compute unit is not easily predicable for users so I wouldn't count on any ordering. Expect the scheduler to do a good job keeping the compute units fed with work.

--- Quote End ---

Altera_Forum · ‎05-08-2014

Keep in mind if you modify the output files generated by the OpenCL kernel compiler you are on your own since you may end up breaking the functionality of the kernel. This is no different than hacking up the RTL for any IP core so if functionality breaks as a result and you require support from Altera you will probably be asked to revert back to a flow that is supported.