SGDMA data transfer question.

Altera_Forum · ‎09-30-2011

Hi,

I did a memory transfer test from RAM to SDRAM using SGDMA, the speed test was evaluated using performance counter core, and it shows the transfer rate goes up to 370 MB/s with all content copied correctly, which is too good to be true!

when i used the same code on RAM to SRAM, the transfer rate only goes up to 50MB/s !!!

WHY? i thought SRAM should be faster than SDRAM? i was testing my code in DE2-115. Maybe the SRAM is slower than SDRAM in DE2-115 ??

PS: Both cases were running at 100MHz.(same clock speed with CPU)

Rregards,

Michael

Altera_Forum · ‎09-30-2011

If that's asynchronous SRAM (i.e. not SSRAM) then you will not be able to read/write to the SRAM every clock cycle. Look at the SRAM datasheet timing to see what I mean.

Altera_Forum · ‎09-30-2011

How about my SDRAM speed? is this even possible at 370MB/s ?? its probably having the same transfer rate as the RAM to RAM .

Altera_Forum · ‎10-03-2011

This kind of speed is possible with SDRAM if you are using bursts or consecutive accesses, which is usually the case when transferring from a DMA.

If you had random transfers, then asynchronous SRAM could become faster than SDRAM in some cases.

Altera_Forum · ‎10-03-2011

--- Quote Start ---

This kind of speed is possible with SDRAM if you are using bursts or consecutive accesses, which is usually the case when transferring from a DMA.

If you had random transfers, then asynchronous SRAM could become faster than SDRAM in some cases.

--- Quote End ---

By random transfer, you mean when data is accessed one at a time (non-burst) ?

Michael

Altera_Forum · ‎10-03-2011

By random he means jumping around in the address space of the SRAM. So SDRAMs are good for sequential bursting type of traffic (high throughput) and SRAM is good for low latency random accesses. SSRAMs are somewhere in between by handling reasonably high throughput and low latency random accesses.

With a long sequential DMA transfer you should be able to hit 400MB/s at 100MHz on a 32-bit local interface.

Altera_Forum · ‎01-18-2013

--- Quote Start ---

With a long sequential DMA transfer you should be able to hit 400MB/s at 100MHz on a 32-bit local interface.

--- Quote End ---

Hi BadOmen, Will I get this throughput if I am doing a transfer from BRAM to BRAM. Once my application is ready, I will be replacing one of the BRAM with DDR memory.

Hi Micheal, are you HAL APIs for SGDMA transfer from RAM to SDRAM? The numbers are very good.

Regards

Altera_Forum · ‎01-18-2013

I assume by BRAM you mean on-chip memory. On-chip RAM is both high speed and low latency so you can't find a more efficient memory-to-memory transfers speed using an FPGA than between two on-chip memories. For example if I moved 50000 words of data between two independent on-chip memories, and assuming the DMA is the only thing accesses them you can expect a transfer time of approximately:

2 cycles (on-chip RAM latency) + ~ 4 cycles (latency through the DMA) + 50000

With an SDRAM that equation becomes (assuming two SDRAMs are involved):

12+ cycles (SDRAM latency) + ~ 4 cycles (latency through the DMA) + (50000 * ~0.95 (SDRAM has addressing and low level command overhead))

So if you replace one of the on-chip memories with SDRAM, it's the SDRAM that will be the limiting factor.

Altera_Forum · ‎01-18-2013

I have a 32-bit data path between SGDMA and on-chip memories. With burst count signal width of 4, I was getting a throughput of around 380MB/s. All components in my system are running at 100MHz. Hoping to increase it further, I have increased the burst count signal width to 16 and my throughput dropped to 200 MB/s.

Can burst count signal width have a negative impact on throughput?

Altera_Forum · ‎01-18-2013

The SDRAM is a synchronize burst device, the transferring speed of this SGDMA method is at the same speed of the driving clock of the sdram controller,

your 370MB value must be divided by 4, approximately 92.5MHz, slightly less than 100MHz, for the row and column address latch cycle and the the read

latency cycle, in fact, the SOPC system can implement the multi- page (256 read or write cycle) transfer, for this reason, it is recommended to enable

the page mode of the SGDMA to get the highest efficency.

The following gave you the reason of the SRAM speed, if you must kown that the PLD devices are unsuitable for the asynchronized memory interface, the

SOPC system implented the random accessing to the SRAM based on the clock of the 1/3 speed of the SRAM controller, if you use this 100MHz, it only 33MHz

be avaliable, well, apparently you only get the speed of 25MHz, I think you can explain this well (you can scope this through the signaltap II tool), ...

in fact, you can get the maximum random access speed up to 100MHz if the asynchronize SRAM is a 10ns devices, at least up to 80MHz, ... it realy a waste

to integrate the asynchronize SRAM on SOPC system, ...

Altera_Forum · ‎01-18-2013

I recommend you pay attention to the pipelined or flow-through SSRAM to achieve the higher transfer efficency (Cypress, ISSI), but the price of these devices really a bit expensive.

Altera_Forum · ‎01-18-2013

On chip memory ought to be able to read/write a location every clock - although the read data isn't available until the following clock cycle. That certainly happens when for the Nios 'tightly coupled data memory'. It would require Avalon burst accesses.

Were you measuring SDRAM reads or writes? they will behave differently.

My experiments suggest that writes are 'posted' (ie acked immediately) unless the logic is busy. The first write will then be actioned, subsequent writes are held in a 'line buffer' (probably 32 bytes, maybe 64) provided they address adjacent locations. When the underlying write completes, the contents (if any) of the line buffer are written out. So writes only stall if they address a location that cann't be buffered.

Reads will read an entire line buffer - then return the requested location. Further reads for nearby addresses return data from the buffer. Fully random reads are about 16 clocks.

Altera_Forum · ‎01-18-2013

Bursting can potentially have two negative effects:

1) In SOPC Builder burst adaptation was inefficient, especially with small burst sizes. The burst adapter would always have a single dead cycle at the start of a burst so if you used a burst count of 2 you are talking at best 3 clock cycles per burst. Qsys doesn't have this limitation, it's just an SOPC Builder thing.

2) With DMAs (not sure about the one you are using) the DMA engine typically waits for a full burst to be read and buffered before issuing out a burst write. The bigger the burst length the more time the write master has to wait before issuing the burst.

Assuming you are using a modern SDRAM controller from Altera, you can change the local burst length of the slave port. I typically turn that down to 1 and let the controller combine sequential accesses into a single offchip burst. Then you don't need to worry about enabling bursting in the master logic or any burst adapters that might be created as a result.

Altera_Forum · ‎06-17-2013

Can you give your code so as to learn about sgdma