Limiting the number of threads on NDRange kernels

Altera_Forum · ‎06-05-2018

Hi,

Is there a way to limit the number of simultaneous executing threads for a ndrange kernel??

My NDRange kernel has a high thread capacity (127 simultaneous threads) and uses local memories.

I suspect that the high number of threads is one of the causes that makes local memories being replicated several times (as the report says).

Is there an "elegant" way of limiting the number of concurrent (pipelined) threads so that the compiler reduces the memory usage??

Now, the compiler crazily replicates hardware as mad (event to more than 2000% as reported by the early estimator).

My current work-around is to introduce a barrier at the end of the outer-loop iteration.

It does not reduce the "thead capacity" reported by the early estimator, but it effectively reduces the memory replication factor.

Best Regards

Altera_Forum · ‎06-05-2018

The thread capacity in the report is basically just the length of the pipeline for that specific block and shows the maximum number of threads that can be in-flight simultaneously in that block. It doesn't mean there are going to be so many threads in flight nor are local memory buffers replicated by such factor. Local memory replication factor depends on the number of reads and write accesses from/to the local memory block and the number of work-groups the compiler decides to run simultaneously. Apparently the latter is the total length of all the pipelines in the kernel divided by the work-group size, which sometimes ends up being an absurd number in the order of 10s or even one hundred. If you post the info the report gives you regarding why and how many times your local memory buffers are replicated, it would be easier to find a way to reduce it.

Altera_Forum · ‎06-05-2018

Thanks HRZ,


ap_uint<32>  lra_regs;
ap_uint<32>  lra_big_regs;

I get the following feedback from the early report.

lra_big_regs, lra_regs:

Requested size 5460 bytes

Implemented size 2080768 bytes

Number of banks 16

Bank width 32 bits

Bank depth 128 words

Total replication 254

Additional Information

Replicated 254 times to efficiently support multiple simultaneous workgroups

Running memory at 2x clock to support more concurrent ports

Actually , it correctly describes the requested size, but replicates memories by a factor 254 !! Ending (in this case) with a BRAM estimate of 180%.

It looks like the compiler (trying to increase the performance) is not aware that it is replicating too much hardware, and there is no way to stop him from doing it.

Altera_Forum · ‎06-06-2018

This is exactly what I was expecting; the compiler is being stupid and trying to pipeline 254 work-groups in the same compute unit just to keep the pipeline full, and the behavior has not changed since v15.1. I usually get less than 10 times but I have seen cases with over 100. The compiler does actually check for Block RAM overutilization; however, if it detects overutilization, it will start sharing the ports rather than reducing the work-group pipelining which is not actually necessary in many cases. They had a note in the documents of v16 that explicitly said this factor cannot be controller; however, they also removed that in the versions after that.

My recommendation: Since Intel is clearly refusing to either fix or give user control over this useless extra replication, which, as you can see, makes a lot of trouble in many cases, please open a ticket with Intel, post your kernel and complain to them so that they might eventually consider adding an attribute for it. In fact, I myself have been planning to open a ticket for this exact issue in the past few days but haven't got the time yet. The more people complain about the same thing, the higher the possibility of them fixing it will become.

As a work-around, if you don't need SIMD, using max_work_group_size instead of reqd will reduce the number of simultaneous work groups to 2 or 3.

...