Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
17254 Discussions

Bandwidth limitations and the resource driven optimizer

Altera_Forum
Honored Contributor II
1,507 Views

Hi all, 

 

I'm implementing a simple vector add kernel and the resource driven optimizer seems to not take advantage of extra logic/registers/blocks on the chip. I understand that the kernel itself is bandwidth bound, and so extra compute would not help performance, but I'm still curious as to how the "throughput" estimation is calculated, and why the optimizer would ever limit the amount of compute resources when logic is available. 

 

Thanks in advance.
0 Kudos
2 Replies
Altera_Forum
Honored Contributor II
742 Views

The thoughtput measurement would take a while to explain but it's supposed to represent the best case work-item retirement rate. If -O3 didn't fill up the chip with more SIMD lanes or compute units then I suspect the throughput estimate did not increase and so the compiler did not bother adding the additional hardware. If more hardware is generated for no performance gains then you are just increasing your compile time without anything to gain. In the case of vector add since the kernel only adds two numbers together throwing more SIMD vector lanes at the problem shouldn't help since it is already limited by the global memory bandwidth (if it didn't help that means it automatically vectorized the kernel through memory coalescing). Putting more compute units into the hardware would again be memory limited and also have a negative impact on performance since each copy of the kernel would have it's own load/store units so by throwing more compute units at the problem you would also have more load/store units fighting over the same memory bandwidth. That wouldn't be ideal because with one compute unit you have one load unit and one store unit that will sequentially access memory. With multiple compute units they'll be doing the same thing only there will be more of them so that's a less than idea memory access pattern because they all access different locations in memory.

0 Kudos
Altera_Forum
Honored Contributor II
742 Views

Thanks! That was very helpful.

0 Kudos
Reply