Bandwidth limitations and the resource driven optimizer

Altera_Forum · ‎03-28-2014

Hi all,

I'm implementing a simple vector add kernel and the resource driven optimizer seems to not take advantage of extra logic/registers/blocks on the chip. I understand that the kernel itself is bandwidth bound, and so extra compute would not help performance, but I'm still curious as to how the "throughput" estimation is calculated, and why the optimizer would ever limit the amount of compute resources when logic is available.

Thanks in advance.

Altera_Forum · ‎04-08-2014

The thoughtput measurement would take a while to explain but it's supposed to represent the best case work-item retirement rate. If -O3 didn't fill up the chip with more SIMD lanes or compute units then I suspect the throughput estimate did not increase and so the compiler did not bother adding the additional hardware. If more hardware is generated for no performance gains then you are just increasing your compile time without anything to gain. In the case of vector add since the kernel only adds two numbers together throwing more SIMD vector lanes at the problem shouldn't help since it is already limited by the global memory bandwidth (if it didn't help that means it automatically vectorized the kernel through memory coalescing). Putting more compute units into the hardware would again be memory limited and also have a negative impact on performance since each copy of the kernel would have it's own load/store units so by throwing more compute units at the problem you would also have more load/store units fighting over the same memory bandwidth. That wouldn't be ideal because with one compute unit you have one load unit and one store unit that will sequentially access memory. With multiple compute units they'll be doing the same thing only there will be more of them so that's a less than idea memory access pattern because they all access different locations in memory.

Altera_Forum · ‎04-15-2014

Thanks! That was very helpful.