FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP

SGDMA to SDRAM speed?

Altera_Forum
Honored Contributor II
4,098 Views

[EDIT] SOLVED 

 

 

 

 

Hello, 

I'm trying to develop an accelerator for a computer vision algorithm. The host program is running on the Intel Atom on the DE2i-150 board, and is sending images to the FPGA for processing. 

 

The problem is that I can't get more than 3 to 10MB/s throughput, which is quite small since the PCI-Express should be able to deliver something about 250MB/s, and the SDRAM should be even faster. 

 

My QSYS is pretty simple, you could see it on the attachment. 

 

I would appreciate if someone could help me with this. Is that a common issue? Can I be able to get more speed, something like 90MB/s or so? If so, how could I accomplish that? 

 

Many thanks.
0 Kudos
31 Replies
Altera_Forum
Honored Contributor II
1,105 Views

The qsys diagram is really small. Can you make a better image?

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Hello, 

I don't know why the forum keeps resizing my image. If you don't mind, I've uploaded the image on the cloud: 

http://1drv.ms/1ovvaw3
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Hello, 

I don't know why the forum keeps resizing my image. And if I post a link, the moderator must accept it, and I think it will take a very long time. 

If you don't mind, I've uploaded the image on the cloud, hope there is no problem posting it here like this: 

ht tp://1drv .ms/1OVVaw3 

 

Just copy and paste it and delete the spaces.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

At a quick glance, it looks like the dma and sdram are on different clocks. Maybe it would be best to place a dual-clock FIFO between the SGDMA and the SDRAM.

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Hello, 

Thanks for your answer. 

Yes, in fact they have different clocks. I tried looking into the solution you have proposed, but the Dual clock FIFO uses Avalon-ST, so the input is a sink ant the output is a streaming source. I'm still a newbie on this thing, so I'm not sure how I should connect this between my SGDMA and the SDRAM. Can you give me some tips? 

 

I've now uploaded a better screenshot, and my qsys diagram in case you or anybody else would like to take a closer look. I'm not intending to delete this in the future, so if we solve this, I will upload the corrected version for other people to download. 

 

(just paste and delete the spaces again if you're willing to take a look at this) 

ht tp://1dr v.ms/1E91SZd
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Well what you can do is change your sgdma to memory - stream, then use DC fifo, then use another DMA to do stream - memory. What would make this easier is if you use the Modular SGDMA which breaks out the descriptor control, write control, and read control. I think you will be able to get away with 1 descriptor control vs 2 if you use Altera's SGDMA.

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Modular SGDMA should be included in qsys now. Otherwise here is the link: http://www.alterawiki.com/wiki/modular_sgdma 

In the diagram in the link, Read Master and Write Master are connected thru Avalon - ST. Place the DCFIFO in that path.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Hello, 

That's nice and looks promising, thank you, I will try this out as soon as possible. Hope this will solve my throughput problem.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Put the dma descriptors into on-fpga memory. 

Make sure the dma controller is doing Avalon burst transfers (probably 128 bytes, preferable 64 bits wide) into the pcie txs port. 

You probably don't want burst transfers into the SDRAM - just pipelined. 

Beware of large fifos in the dma controller - you don't need them. 

The pcie txs block seems to complete write transfers quickly. I'm seeing reads take 128 clocks (of the 62.5MHz app clock) + a few clocks for the transfers size. 

The same is true of host initiated transfers, writes are 'posted' and happen more or less back to back but there is a 128 clock delay between reads. 

 

The only way to speed up DMA reads from host memory (once you are generating Avalon bursts and thus long PCIe TLP) is to generate concurrent read requests from multiple avalon masters.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

krasner, 

I've tried doing what you said, using mSGDMA. I think I've done something veeery wrong, because my board actually stopped working until I've reset the BIOS (I even thought I had lost it :( ). Before putting in the FIFOs, I've done the basic scheme, using the dispatcher and the read / write masters. 

Here is the qsys scheme I've done for this: 

ht tp://1dr v.ms/1IpfnYH 

Notice that I've removed the old PLL, and now I'm using the pcie_core_clk, which provides 125MHz, on the DMA and SDRAM. I did that because the TimeQuest said it was not possible to get the 150MHz I was willing for. 

 

I've also uploaded the new qsys on the same folder as before in case someone wants to take a look: 

ht tp://1dr v.ms/1E91SZd  

 

dsl, 

Where should I put the dma descriptors? I'm very new, so sorry if that's a stupid question. 

Notice that I'm not interested in using the onchip memory, but only to speed up the SDRAM. Also, I've tried using bursts but I don't understand how to use it. Everytime I activate burst transfers on the DMA, my host application stops being able to communicate with the FPGA. 

 

Just to clarify, I'm using the Jungo Windows driver provided with the board. It works fine with the DMA scheme I've been using, but I'm getting these slow transfers I've been talking about. Is there any chance that the problem is actually in the driver? 

 

Hope you guys still have some patience to help me. 

 

Many thanks.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

I think this example is very close to what you want: http://www.alterawiki.com/wiki/pci_express_in_qsys_example_designs. I'm wondering if you setup the dma_read_master and dma_write_master options correctly. 

 

See if this example works for you.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

I've came across this example when searching for something about mSGDMA. I've basically downloaded the qsys provided on this page and adapted it a bit, before trying to put the DC FIFO There were indeed some configurations I didn't know how to setup. For example: 

Descriptor FIFO Depth on the mSGDMA 

Length Width on read master and write master 

Transfer Types under Memory Access Options on read master and write master
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Can you try setting your dma setting like in the example? Let us know what happens.

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Yes that was my first try. After that I tried to rewrite everything disabling some stuff like burst and it kept happening. 

I'm starting to give up, but it doesn't make any sense since the PCIe should be fast enough even if bad configured, I think. And the same applies to the SDRAM. That's why I'm still trying.
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Don't give up. dsl was talking about descriptors. Do you understand what they do? Maybe that is your bottleneck. If you aren't setting them properly then the dma won't process continuously.

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Actually I don't understand much about this stuff, I'm starting out because of the Intel Embedded Systems Competitions (http://sbesc.lisha.ufsc.br/sbesc2015/intel+embedded+systems+competition). I've tried doing a LOT of research but the documentation seems to be sparse on what concerns these IP modules (PCIe, DMA, and many others). The best documentation that I've found was the "Video and Image Processing Suite User Guide", which is pretty good and deals with the whole Avalon-ST stuff but takes a lot to get to the point. 

 

So, any help would be welcome when dealing with this. Long story short, I don't know how descriptors work. :(
0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

One thing is that this PCIe communication is a huge problem here, everybody that I know that is competing have this problem. Solving this would be a big factor for us, and it seems that I'm the one who is digging deeper into this.

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

A descriptor is an instruction set that the master (in your case pc) gives the DMA that describes where to access data and how much data to take. In a memory-memory application like you have, the descriptor should contain the address locations of the source data (pcie) and the sink for the data (sdram) (or vice-versa) and the number of bytes to transfer in 1 pass or burst. With the mSGDMA you can write many descriptors into the descriptor buffer (see the dispatcher) and you can command the dispatcher to chain them: so once one descriptor is complete, the dma will automatically begin working on the next one. This way you can continuously transfer data.

0 Kudos
Altera_Forum
Honored Contributor II
1,105 Views

Maybe this is a good overview: http://www.academia.edu/1741560/pcie_express 

 

This is the same as the document on the alterawiki link
0 Kudos
Altera_Forum
Honored Contributor II
1,093 Views

BTW, what board are you using? I assume its the cyclone iv gx fgpa development board. This board has a ddr2 sdram. If so, use this example: https://www.altera.com/support/support-resources/software/download/refdesigns/ip/interface/dnl-pciexpress-ddr3-sdram.html 

 

Go to the link named: Qsys PCI Express to External Memory reference design for Cyclone IV GX 

 

This is probably exactly what you want.
0 Kudos
Reply