Question about M20K block packing

DimitrisGrn · ‎07-19-2024

I am using Intel Quartus Prime 21.1, targeting the Stratix 10 MX 2100 device.

I have several read/write avalon memory-mapped interfaces from a Load/Store unit that are connected to True Dual-Port RAMs. I am using double-buffering, so each interface is connected to 2 such RAMs, with a simple demultiplexer interconnect, which lies between the load/store unit and the RAMs. The RAMs are implemented using "On-Chip Memory (RAM or ROM) IP" from Platform Designer. The ports of the RAMs are 32-bit wide and the size of the RAMs is 16384 Bytes. My simple design has only 28 such RAMs for now.

Since M20K blocks using 32-bit wide ports are configured at the 512x32 mode, it means that a total of 8 M20K blocks are needed to implement each RAM. This leads to an 80% utilization of the available block memory bits, as 100% utilization requires the 512x40 operating mode.

Nonetheless, the compiler is able to optimize the M20K packing and allocate 8 M20Ks for some of the RAMs, but fewer for the others, boosting block memory bit utilization to roughly 99%.

However, if I add a pipeline stage for the interface signals between the Load/Store unit and the RAMs (more specifically between the Load/Store unit and the demultiplexer interconnect, the compiler uses 8 M20Ks for all RAMs, dropping block memory bit utilization down to 80%.
My assumption is that the fitter does this in order to improve timing.

I tried to force a Synthesis setting of maximum number of M20Ks to be used to what was before adding the pipeline stage, but it gets ignored by the Fitter.

Do you know of a way that I can control this packing and guide the compiler to always try and maximize M20K block memory bit utilization? Saving this significant number of M20Ks will greatly help me in fitting my final design.

KennyTan_Altera · ‎07-29-2024

Hi,

Usually we only able to follow the coding style mention here https://www.intel.com/content/www/us/en/docs/programmable/683082/22-1/inserting-hdl-code-from-a-provided-template.html if you want to infer the ram. If you try to use different way, it will not be able to infer the ram.

Have you also try try the IP instead of writing the code?

Thanks

DimitrisGrn · ‎07-29-2024

Thank you for your reply.

My issue is not that the tool does not properly infer the RAMs. I am using Intel's IP named "On-Chip Memory (RAM or ROM) IP" which is found in the Platform Designer.

The issue is that, if I use pipeline stages and interconnects to connect different agents to different SRAMs, the M20K block packing to implement these SRAMs has an efficiency of 80%, i.e. 20% of the available M20K capacity is wasted, if I am using 32-bit wide ports.

I noticed that the efficiency was nearly 100% when I had no pipeline stages and if only one agent connected to each SRAM.
Efficiency dropped to 80% if multiple agents connect to each SRAM (i.e. an interconnect is needed) and/or I add pipeline stages between the agent and the SRAM ports.

Shuo_Zhang · ‎07-30-2024

Hi,

When the software implements the RAM instances on the M20Ks, it will merge several instances to one M20K when one RAM instance does not fully occupy one M20K block. But things going to be different when you enable the different pipeline registers because the port and port register resource on one M20K is limited. If the port number and the port register number on one M20K are insufficient to support the combinational implementation of several RAM instances, the software will split the RAM instances into different M20K blocks.

If the M20K is not enough, please consider using the MLAB resource. When both the M20K and MLAB resources are exhausted, it would be hard to close timing, at that case, please consider reducing the scale of your design.

Best Regards,

Shuo

Question about M20K block packing

Design Entry|Synthesis|Compilation

Implementation|Optimization