Re: NDrange, work-itme level parallelism vs work-group level parallelism

Altera_Forum · ‎12-12-2017

Hello,

I have an ambiguity regarding ND-range. Suppose we have a ND range with 1 Device, 4 CUs (compute units), and 1 PE inside each CU (1 PE means no SIMD). I already know that loop pipelining is disabled in NDrange configurations. Now, consider below 2 recommendations:

- Try to use large enough workgroup size to get benefit of multi-threading of many work-items over that single PE. I can guess why, probably the PE is pipelined over work-items (is it right?) and then pipeline is efficiently use if there are many work-items.

- Try to use large number of work-groups to get benefit of multiple CU. I really do not understand this. Are n't CUs completely independent? Why when I have multiple CUs, tool recommends this to me? how can be parallelism on work-group levels?

Thanks

Altera_Forum · ‎12-13-2017

--- Quote Start ---

- Try to use large enough workgroup size to get benefit of multi-threading of many work-items over that single PE. I can guess why, probably the PE is pipelined over work-items (is it right?) and then pipeline is efficiently use if there are many work-items.

--- Quote End ---

This is true.

--- Quote Start ---

- Try to use large number of work-groups to get benefit of multiple CU. I really do not understand this. Are n't CUs completely independent? Why when I have multiple CUs, tool recommends this to me? how can be parallelism on work-group levels?

--- Quote End ---

This is more or less the same concept as above. Let's say you have a total of six work-groups. The time to process a work-group by a CU is X seconds. In this case, four work-groups will be scheduled into the four available CUs simultaneously. When finished, the remaining two work-groups are scheduled into two CUs, leaving the other two CU unused. In the end, the process will finish after 2X seconds. Now, a basic math tells you that in this case, even if you had only three CUs, run time would still be 2X; hence, you do not get any benefit from the extra CU, since you do not have enough work-groups to fully utilize the CUs all the time. However, if you have a large-enough number of work-groups, having four CUs will be ~33% faster than having three. Note that this is the theoretical case; in practice, performance scaling with multiple CUs also depends on external memory bandwidth and operating frequency.

Altera_Forum · ‎12-13-2017

--- Quote Start ---

This is true.

This is more or less the same concept as above. Let's say you have a total of six work-groups. The time to process a work-group by a CU is X seconds. In this case, four work-groups will be scheduled into the four available CUs simultaneously. When finished, the remaining two work-groups are scheduled into two CUs, leaving the other two CU unused. In the end, the process will finish after 2X seconds. Now, a basic math tells you that in this case, even if you had only three CUs, run time would still be 2X; hence, you do not get any benefit from the extra CU, since you do not have enough work-groups to fully utilize the CUs all the time. However, if you have a large-enough number of work-groups, having four CUs will be ~33% faster than having three. Note that this is the theoretical case; in practice, performance scaling with multiple CUs also depends on external memory bandwidth and operating frequency.

--- Quote End ---

then do you mean in the case of number of work-groups on CUs, the divisibility matters? then why it is not suggested to use dividable number of work-groups? assuming that work-groups all have same runtime (which may not always be true if the kernel code has group-id dependent control statement) running exactly 4 workgroups (1 per CU) is enough to not leave any CU idle, no need to use many workgroups on each CU.

and another related question, is there any pipelining in work-group level? meaning next work-group enters into CU, while previous one is still there?

thanks for your helps

Altera_Forum · ‎12-13-2017

Maybe I simplified things a little bit too much. It is not just about divisibility. As you said, there is no guarantee that work-groups running in different CUs would finish at the same time, hence some CUs will always remain unused. However, with more work-groups, the chance of a CU being unused will get smaller, resulting in closer-to-linear speed-up with number of CUs. Furthermore, at least based on what Altera's report claims, there is also work-group pipelining in place and hence, there could be multiple work-groups in-flight in the same CU at the same time and having more work-groups will further help to keep the CU busy.

Altera_Forum · ‎12-13-2017

--- Quote Start ---

Maybe I simplified things a little bit too much. It is not just about divisibility. As you said, there is no guarantee that work-groups running in different CUs would finish at the same time, hence some CUs will always remain unused. However, with more work-groups, the chance of a CU being unused will get smaller, resulting in closer-to-linear speed-up with number of CUs. Furthermore, at least based on what Altera's report claims, there is also work-group pipelining in place and hence, there could be multiple work-groups in-flight in the same CU at the same time and having more work-groups will further help to keep the CU busy.

--- Quote End ---

I understand.

In Altera's OpenCL, for single work-item kernel (task) loops are pipelined. But for NDrange, when a work-group with many work-items runs on one PE (one PE means no SIMD), and also in the similar and higher-level scenario, when multiple work-groups run on a CU, how the parallelism is implemented? pipelining or multi-threading? I guess it is pipelining, but I am not sure. Do you have any detailed document about this? because I could not see any details in "programming guide" and "best practices guide" manuals.

Altera_Forum · ‎12-14-2017

There is no explicit multi-threading per CU, unless you use SIMD. In other words, without SIMD, there will never be two work-items from the same work-group entering the CU pipeline at the same clock, simply because there is just one pipeline, and the work-items are instead pipelined.

With multiple CUs, however, you can assume that you have some high-level multi-threading, since there will be work-items from different work-groups running in parallel in different CUs.

There is little to no info about the inner workings of Altera's compiler and how the circuit is implemented, other than what exists in the two guides you mentioned. Anything I say here is based on my own understanding of the compiler after experimenting with many kernels over the past few years.

Altera_Forum · ‎12-14-2017

Thanks alot for sharing your experiences.