Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Valued Contributor III
1,511 Views

PCIe bus master random access to all host memory

I'm not sure if the PCI Express MegaCore hard IP can do what I want. I'll describe what I'm trying to achieve. 

 

I'd like to set up a bus-master scatter-gather DMA between on-device memory and a host memory buffer. The host memory buffer is contiguous in virtual address space, but when locked down into 4K-sized physical memory pages, the physical addresses of those pages are randomly scattered all over physical memory - a classic usage case of scatter-gather DMA. 

 

I'm not sure how this fits in with the Avalon-MM-to-PCI Express address translation table. Would I have to set up a translation table entry for each 4K page or could I set up a few translation table entries to cover the whole host memory physical address space (or at least the bottom 3 GB ? I'm more familiar with "PCI to local bus" bridges (e.g. the ones made by PLX Technologies) that have completely separate address spaces on each side of the bridge. 

 

The only example I've got to go on is the WinDriver code for the "PCI Express in Qsys Example" design, but that's nothing like what I want as it allocates a contiguous area of physical memory in the bottom 16 MiB of physical memory.
0 Kudos
14 Replies
Highlighted
Valued Contributor III
48 Views

Hmmm..... I'd have thought that the DMA controller would be able to pass a full 32bit PCIe-space address, rather than an Avalon MM slave address of the PCIe controller. 

This partially important because you need the PCIe controller to issue single PCIe requests for large blocks (say 128 bytes min) in order to get reasonable throughput - which is likely to mean that you need to use a DMA controller embedded into the PCIe interface. 

(That is what we had to do on the pcc when the fpga was a slave...)
0 Kudos
Highlighted
Valued Contributor III
48 Views

To be honest, I'm not entirely sure what the address translation tables are actually translating, as each entry only contains one 64-bit address rather than say a PCI address and a matching local address. Since the areas mapped by each table entry are all the same size, I'm guessing that consecutive table entries map to consecutive regions of local address space starting at some local base address. 

 

(EDIT: To avoid any confusion, my device is normally a PCI target except during bus-master DMA operations. I.e. it's like a PCI plug-in peripheral card for a PC.)
0 Kudos
Highlighted
Valued Contributor III
48 Views

 

--- Quote Start ---  

 

 

I'd like to set up a bus-master scatter-gather DMA between on-device memory and a host memory buffer. The host memory buffer is contiguous in virtual address space,  

<...> 

 

I'm not sure how this fits in with the Avalon-MM-to-PCI Express address translation table.<...> 

 

--- Quote End ---  

 

 

That's perfectly possible.  

 

You do not need any (BAR) translation table on the FPGA, because the FPGA is doing SGDMA directly to/from host memory, so there is no BAR access involved. 

 

For a ready to go SGDMA solution with drivers, see http://www.lancero-pcie.com/ 

 

Best regards, 

 

Leon Woestenberg
0 Kudos
Highlighted
Valued Contributor III
48 Views

Our design is basically a plug-in PCIe card that is normally a PCI target, but which will support PCI bus-master DMA transfers. We'll be using the PCIe hard-IP on Cyclone IV with the Qsys design flow and the Avalon-MM interface. 

 

We were thinking of using Altera's Modular SGDMA (mSGDMA) controller. As I understand it, that only works with 32-bit Avalon addresses and wouldn't know anything about PCI bus addresses, so some mapping between PCI addresses and Avalon addresses would be required. 

 

As I understand it, the Altera PCIe hard-IP provides the standard PCI BAR registers to map certain regions of the Avalon address space into the PCI bus address space. The Avalon base address and size of each BAR region is assigned in the FPGA design, and the assignment of the PCI bus base address of each BAR region is assigned by the PCI host (on a PC, firstly by the PCI-BIOS code, and later by the host operating system). 

 

Things like the mSGDMA controller only know about Avalon addresses, so in order for it to access regions in the PCI bus address space, the PCI bus addresses need to be mapped into the Avalon address space. That's what the "Avalon-MM-to-PCI Express Address Translation Table" is for. This table is either fixed or dynamic and maps one or more "pages" of PCI bus address space into the Avalon-MM address space. Up to 16 fixed table entries or up to 512 dynamic table entries can be defined and the page size can be configured to any value from 2^12 (4096 bytes) to 2^32 (4 gibibytes). 

 

There are three main possibilities I can think of: 

 

 

  1. Use a single, very large fixed page to map PCI bus address 0. Using a page size of 2^30 or 2^31 would allow access to PCI bus addresses in the range 0 to 2^30 - 1 (0 - 1 GB ) or 0 to 2^31 - 1 (0 - 2 GB ). (I don't think a page size of 2^32 could be used, as that would use up all the Avalon address space.) 

  2. Use a single or a few fairly large dynamic pages. This would allow access to one or a few fairly large, contiguous regions of PCI bus address space, dynamically configurable by the host PC software. 

  3. Use a large number of small (4096-byte) dynamic pages. This would allow a large number of 4096-byte pages to be simultaneously mapped from PCI bus address space into Avalon address space. The page size of 4096 would match the physical memory page size of the host system (at least for most host architectures). If using mSGDMA controllers, you'd need one page table entry for each committed scatter-gather descriptor, up to a maximum of 512, so the maximum amount of PCI memory that could be mapped simultaneously would be 2 MB (4096x512 byte). 

 

 

 

(The above is just my understanding and may be partly or completely wrong. Corrections of my misunderstanding are appreciated!) 

 

I don't think we have the budget to use third-party IP in this project, but I'm interested in how the Lancero solution compares to the Altera offerings.
0 Kudos
Highlighted
Valued Contributor III
48 Views

To get any performance at all out of PCIe, you have to use PCIe burst transfers (these are hdlc frames that carry multiple words of data). 

 

So if you are doing a PCIe read, the PCIe interface logic has to know how many words of data you are going to require before it requests any of the data. This can only really be done if the DMA controller in integrated into the PCIe master logic. 

 

Although there are pipelined Avalon cycles, I suspect they are not capable of being converted into PCIe read bursts (they might be translatable into PCIe write bursts as that is easier). 

 

But I've not used the Altera PCIe master (just the slave).
0 Kudos
Highlighted
Valued Contributor III
48 Views

Quoting last two posts (different authors): 

 

 

--- Quote Start ---  

The above is just my understanding and may be partly or completely wrong.  

 

--- Quote End ---  

 

I think you are quite right there. 

 

 

--- Quote Start ---  

This can only really be done if the DMA controller in integrated into the PCIe master logic. 

--- Quote End ---  

 

There are a lot of optimizations and removal of limitations if the SGDMA controller is integrated in the PCIe master logic. Arbitrary access to 64-bit host memory is just one thing. 

 

 

--- Quote Start ---  

... I'm interested in how the Lancero solution compares to the Altera offerings. 

--- Quote End ---  

 

I don't think this forum is meant to discuss Altera Partner IP, so I consider to look at its web page and ask.
0 Kudos
Highlighted
Valued Contributor III
48 Views

We have started a new design using Qsys targeting a Stratix V and have arrived at the same dilemma. In fact, I had come up with almost the same possibilities you listed to work with the Avalon-MM Stratix V Hard IP for PCI Express. I’ve begun considering the unpleasant option of trying to use the Avalon-ST version of the PCIe Hard IP and developing our own TLP encoders and decoders, DMA controllers, etc. I’m very curious which solution you ended up using, ijabbott. Cheers!

0 Kudos
Highlighted
Valued Contributor III
48 Views

IP Compiler for PCI Express generates the Avalon slave 'tsx' port with txs_burstcount[6..0] - this suggests bursts of up to 64 cycles * 64bits data width = 512 bytes (128 DW) which is a reasonable size for a burst TLP. With this port the TLP headers can be correctly set ahead of payload transmission. 

 

Since the Core uses a fixed hardware buffer for retransmission, with fixed 'pages' for each TLP (rather than circular buffer) there is a payload limit for packets of either 256 bytes or 128 bytes. So certain values of the txs_burstcount imply packets too big to buffer, with 256 bytes limit would be 64DW or 32 burst cycles. 

 

Given the Altera DMA controller uses this same HW port to do it's DMA writes you should be able to get at same performance writing to that port yourself without resorting to the streaming interface. Using the streaming interface I doubt you can exceed 64DW as you need retransmission support.
0 Kudos
Highlighted
Valued Contributor III
48 Views

Shuckc is correct about the bursting properties of the bridge -- if you create a burst transaction on the Avalon side of the bus, it will pack all the data it can (based on other PCIe rules) into one TLP for maximum throughput.  

 

However, I think the question you're asking is more basic. There are two modes you can use the PCIe IP in: Completer only or requester/completer. If you're going to use a DMA in the FPGA, you need to use requester/completer mode. The DMA will write data to the "txs" (transmit slave) port on the IP (this port is in addition to one or more BARs, which are avalon master ports). Then you'll use an address translation table to convert Avalon addresses to PCIe (64-bit) addresses. For the txs port, the translation table can contain 512 entries of a given size (so you could have 512 x 1MB pages, or 512 x 2MB pages, or fewer entries 4 x 256MB pages, or smaller -- 2x 4kB, etc.). These can be static or dynamically allocated (through the CRA port). Have a look at the IP for PCI Express User Guide section called PCI Express Avalon MM Bridge, and the "address translation" section. Basically on the avalon side, the MSbits of your master address (going to the txs slave) will be translated to (a greater number of) the MSbits of the 64-bit PCI address space, based on the table.  

 

This is separate from the address translation table for BARs, which operates a lot differently, but won't get you the throughput you're looking for. 

 

So if your txs port was set up with translation table size of 1KB (to make the math easy) and you had 2 entries, and you set up the txs port to start at avalon location 0x0, you'd have an address range of 0x0 to 0x800 for the txs port. Now your translation table would look like: 

0x0-0x3ff -> some 64-bit address range that's 1k in size 

0x400-0x7ff -> another (possibly totally different) 64-bit address range that's 1k in size.
0 Kudos
Highlighted
Valued Contributor III
48 Views

 

--- Quote Start ---  

Shuckc is correct about the bursting properties of the bridge -- if you create a burst transaction on the Avalon side of the bus, it will pack all the data it can (based on other PCIe rules) into one TLP for maximum throughput.  

 

However, I think the question you're asking is more basic. There are two modes you can use the PCIe IP in: Completer only or requester/completer. If you're going to use a DMA in the FPGA, you need to use requester/completer mode. The DMA will write data to the "txs" (transmit slave) port on the IP (this port is in addition to one or more BARs, which are avalon master ports). Then you'll use an address translation table to convert Avalon addresses to PCIe (64-bit) addresses. For the txs port, the translation table can contain 512 entries of a given size (so you could have 512 x 1MB pages, or 512 x 2MB pages, or fewer entries 4 x 256MB pages, or smaller -- 2x 4kB, etc.). These can be static or dynamically allocated (through the CRA port). Have a look at the IP for PCI Express User Guide section called PCI Express Avalon MM Bridge, and the "address translation" section. Basically on the avalon side, the MSbits of your master address (going to the txs slave) will be translated to (a greater number of) the MSbits of the 64-bit PCI address space, based on the table.  

 

This is separate from the address translation table for BARs, which operates a lot differently, but won't get you the throughput you're looking for. 

 

So if your txs port was set up with translation table size of 1KB (to make the math easy) and you had 2 entries, and you set up the txs port to start at avalon location 0x0, you'd have an address range of 0x0 to 0x800 for the txs port. Now your translation table would look like: 

0x0-0x3ff -> some 64-bit address range that's 1k in size 

0x400-0x7ff -> another (possibly totally different) 64-bit address range that's 1k in size. 

--- Quote End ---  

 

How about PCIe Qsys design flow by using avalon-mm interface? How user logic feed data and receive from PCIe core inside Qsys?
0 Kudos
Highlighted
Valued Contributor III
48 Views

You need to use one of the dma controllers to do the burst writes into the Avalon slave interface of the PCIe block. 

Before requesting the transfer you'll need to set the address translation tables so that the correct physical address bits are used. 

 

Last time I looked none of the DMA controllers supported 64bit addressing on one port, so it isn't possible to avoid the address translation tables in the PCIe block. 

I also remember having difficulty configuring 32bit address transparancy. 

 

I can't imagine that you'd want to link the PCIe Avalon slave to a normal master (like a nios cpu) - since you really don't want to stall while the transfer tales place. Better would be a 'single transfer (degenerate) DMA controller' to which you write the physical address and data and then poll for completion. 

 

Another useful item would be a memory block that is dual ported as an Avalon slave and to 'PCIe dma logic'. You could then arrange for the data to be in this special memory block and directly request a PCIe transfer to/from it. This would save resources and reduce latency. 

 

Unfortunately Altera don't seem to be making this easy to use.
0 Kudos
Highlighted
Valued Contributor III
48 Views

 

--- Quote Start ---  

You need to use one of the dma controllers to do the burst writes into the Avalon slave interface of the PCIe block. 

Before requesting the transfer you'll need to set the address translation tables so that the correct physical address bits are used. 

 

Last time I looked none of the DMA controllers supported 64bit addressing on one port, so it isn't possible to avoid the address translation tables in the PCIe block. 

I also remember having difficulty configuring 32bit address transparancy. 

 

I can't imagine that you'd want to link the PCIe Avalon slave to a normal master (like a nios cpu) - since you really don't want to stall while the transfer tales place. Better would be a 'single transfer (degenerate) DMA controller' to which you write the physical address and data and then poll for completion. 

 

Another useful item would be a memory block that is dual ported as an Avalon slave and to 'PCIe dma logic'. You could then arrange for the data to be in this special memory block and directly request a PCIe transfer to/from it. This would save resources and reduce latency. 

 

Unfortunately Altera don't seem to be making this easy to use. 

--- Quote End ---  

 

 

Hi dsl 

 

Thanks for you reply! 

Yes, altera doesn't seem to be easy to use. When my data is ready inside dual-port ram, how can i iniate(request or start) a PCIe transfer? Is there any status or flag signal for application software to poll for this data transfer?
0 Kudos
Highlighted
Valued Contributor III
48 Views

I'm now trying to get this to work, and of course it doesn't. 

I don't need high throughput but do need long TLPs (read and write) and asynchronous operation (controlled by a nios cpu). 

The 'simple' DMA controller ought to work, but when I request a transfer all that happens is the 'BUSY' bit in the status register is set. 

Unfortunately it is a bit difficult to connect the JTAG port to the board I'm using, making signaltap unusable. 

 

It might just be that I've configured the DMA controller incorrectly (in sopc). 

I haven't found any info into the required parameters for PCIe DMA. 

I decided that the 'best bet' was to 'enable burst transfers' and I set the 'maximum burst size' to 128. 

There is some strange comment in the documentation that the maximum transfers length 

must be less than the maximum burst length - is this really true? 

The longest transfer I need to do is 0x120 bytes, most will be shorter. 

 

Possibly I should be enabling (and using) 64bit transfers? 

 

Any ideas what I've done wrong? 

 

The SGDMA is overcomplex for what I need.
0 Kudos
Highlighted
Valued Contributor III
48 Views

Hi all, 

 

- I am already having a design that communicates with x86 processor from NIOS II using the shared memory over the PCIe interface. 

- With new design my aim would be to reverse the shared memory and place it on the x86 processor DDR memory instead of FPGA SSRAM. 

- And aware of this would require some complex address translation logic to be included in the fabric. 

- I came to know that, "txs" signal in the PCIe interface ip core will access the host memory, but I want to know how that signal will access DDR memory or some other internal memory on x86 processor. 

- Also I like to know how DDR memory in x86 processor is used? 

- I am interested to know if someone has already achieved something similar and if it is possible to get hold of a reference design for this or related configuration to start with.
0 Kudos