Re:Intel HLS pipeline::lsu won't dispatch more than 8 request

AUT · ‎04-13-2024

I'm creating a design to load data from HBM using an Avalon memory mapped interface. Due to HBM not supporting typical bursting I need to dispatch as many individual request as possible to get the full bandwidth. Since I utilize the full data width natively and can't burst I decided to use a pipelined lsu to reduce resource utilization as the burst coalesced LSU instantiates many features I can't use, wasting space.

lsu<style<PIPELINED>, static_coalescing<false>>;

However whenever I run my component it will only dispatch 8 read request before stalling until it receives a response. I have checked using signal tap and the incoming wait-request signal is never asserted. This clearly indicates that the behavior is internal and caused by the HLS compiler. I cannot figure out how to change this behavior and there doesn't seem to be anything in the documentation about it.

From what I can gather it seems like HLS is arbitrarily instantiating the LSU with a FIFO size of 8. This behavior appears to be controlled by the Verilog parameter KERNEL_SIDE_MEM_LATENCY when creating a LSU. The value is hard coded in one of the generated files and I don't want to have to manually change it every time I re synthesize my design. I also do not know if there are other modules that will behave undesirably if I increase this FIFO size.

Is there a easy way for me to tell the LSU to make it's FIFO bigger without having to modify the underlying Verilog? I know I could use bursting with a burst adapter thus allowing me to use a typical burst-coalesced LSU but I would like to avoid adding unnecessary components and adapters to my design.

Below is an example of the LSU that HLS instantiates

lsu_top #(
        .ABITS_PER_LMEM_BANK(0),
        .ADDRSPACE(1025),
        .ALIGNMENT_BYTES(64),
        .ALLOW_HIGH_SPEED_FIFO_USAGE(0),
        .ASYNC_RESET(0),
        .ATOMIC(0),
        .ATOMIC_WIDTH(3),
        .AVM_READ_DATA_LATENESS(0),
        .AVM_WRITE_DATA_LATENESS(0),
        .AWIDTH(32),
        .BURSTCOUNT_WIDTH(1),
        .ENABLE_BANKED_MEMORY(0),
        .FORCE_NOP_SUPPORT(0),
        .HIGH_FMAX(1),
        .INPUTFIFO_USEDW_MAXBITS(5),
        .KERNEL_SIDE_MEM_LATENCY(7),
        .LMEM_ADDR_PERMUTATION_STYLE(0),
        .MEMORY_SIDE_MEM_LATENCY(0),
        .MWIDTH_BYTES(64),
        .NUMBER_BANKS(1),
        .PROFILE_ADDR_TOGGLE(0),
        .READ(1),
        .STALLFREE(0),
        .STYLE("PIPELINED"),
        .SYNCHRONIZE_RESET(0),
        .USECACHING(0),
        .USEINPUTFIFO(0),
        .USEOUTPUTFIFO(1),
        .USE_BYTE_EN(0),
        .USE_STALL_LATENCY(0),
        .USE_WRITE_ACK(0),
        .WIDE_DATA_SLICING(0),
        .WIDTH_BYTES(64),
        .WRITEDATAWIDTH_BYTES(64)
    ) thei_llvm_fpga_mem_a1_all_buff_sroa_0_0_copyload1_ld_unit5121 (

AUT · ‎04-18-2024

Some more follow up information I tried increasing the KERNEL_SIDE_MEM_LATENCY to 63 and the number of dispatched request did increase. However, they didn't increase to 64 they increased to 41? Again this is internal to the HLS module as there is no incoming waitrequest signal. This really confuses me as I figured any internal limits would be a factor of 2. Additionally this doesn't seem consistent as someone I am working with reported that they could only get 10 request to dispatch when modifying a slightly different design with a pipelined LSU.

I would really appreciate some help with getting the pipelined LSU to work as using a burst interface increases usage by ~2-4x for most resources and by ~40x for M20K blocks. As the number of channels scales to saturate all the HBM channels this will begin to waste non-negligible amounts of resources impacting our final design performance. Given the Stratix's already limited amount of M20K that waste is really making things difficult.

BoonBengT_Altera · ‎04-22-2024

Hi @AUT,

Thank you for posting in Intel community forum, hope all is well and apologies for the delayed in response.

May I know which Intel HLS version are you working on with the relevant devices involved?

And based ont he instantiates example mention, what are the example design that you are referring to?

Hope to hear from you soon.

Best Wishes

BB

AUT · ‎04-22-2024

Hi @BoonBengT_Altera

I am using HLS 21.1 with a Stratix 10MX developer kit. Sorry if I misspoke, it isn't an example design it is an example of the LSU that HLS generates for my design. The module I posted is from the Verilog that HLS generates. It is generated from a standard mm_master interface with a pipelined lsu transfers specified.

BoonBengT_Altera · ‎04-29-2024

Hi @AUT,

Noted on the version used and device involved, thanks for the explanation.

Based on the explanation and request would recommended parhaps to look at a type of LSU which is the burst coalesced.

The allow larger and more robust order to utilize the memory bandwidth more efficiently. You may refer to more explanation of the coalesced type LSU in the link below:

- https://www.intel.com/content/www/us/en/docs/programmable/683152/24-1/control-lsus-for-your-variable-latency.html

There are also best practices and sample codes which could demo the LSU which comes with the HLS installation, you may find them under the following path:

- <quartus install directory>\hls\examples\tutorials\best_practices\lsu_control

Hope that clarify

Best Wishes

BB

AUT · ‎04-29-2024

Having to instantiate a significantly more complex LSU isn't a helpful solution. Doing the math we won't be able to complete our design with HLS if we have to use burst coalesced LSUs. Our design would use none of the features from the burst coalesced unit. It seems like there should be a way to indicate to HLS that I want a deeper pipelined LSU without wasting copious amounts of of resources.

We are starting to switch to Verilog at this point since there doesn't seem to be an answer to this question. If anyone has a solution to this we would greatly appreciate it as having to rewrite our design in Verilog is significantly affecting our timeline.

justin-rosner · ‎05-09-2024

Hi @AUT ,

Unfortunately at this time, without specifically modifying the generated Verilog (i.e. updating the KERNEL_SIDE_MEM_LATENCY so that the instantiated FIFO is larger), there is no way to increase the capacity of the FIFO associated with the pipelined LSU. What is the desired number of dispatch requests that you are trying to achieve?

AUT · ‎05-09-2024

Hi @justin-rosner

Thank you for letting me know, we will continue with development of our Verilog based design. Given the high latency of HBM I currently need around 128 outstanding request. The best I can get even upping KERNEL_SIDE_MEM_LATENCY is 41 request so I think other more involved changes would be needed in the generated Verilog to exploit the full depth.

If possible in a future release I think a feature similar to Xilinx's num_read_outstanding would be useful especially for HBM/DDR designs.

Best,

Austin

BoonBengT_Altera · ‎05-07-2024

Hi @AUT,

Noted that the coalesced LSU type may not be an options for your design.

Also as previous mention that you have been using the pipelined lsu which have the following unit type as below at the moment:

- https://www.intel.com/content/www/us/en/docs/programmable/683349/24-1/load-store-unit-types.html#cut1573320429345__neverstall-pipelined

Unfortunately, at this point of time there are no options to increase the size of the pipelined LSU.

Best Wishes

BB

BoonBengT_Altera · ‎05-12-2024

Hi @AUT,

Noted with thanks on the additional features mention and we would take that back into consideration with our latest product. As with no further clarification on this thread, it will be transitioned to community support for further help on doubts in this thread. Please login to ‘https://supporttickets.intel.com’, view details of the desire request, and post a feed/response within the next 15 days to allow me to continue to support you. After 15 days, this thread will be transitioned to community support.

Thank you for the questions and as always pleasure having you here.

Best Wishes

BB

Intel HLS pipeline::lsu won't dispatch more than 8 request

High Level Synthesis (BSP | Compiler | IP Integration)