Ok - Ive been staring at the PCIe Userguide for a couple of days - and I dont really know where to start with this thing. We have a design that requires the following setup:PC (currently windows, later Linux) <-- PCIe x1 --> Arria2GX190 16k Shared ram <-- Control Interface --> Arm Chip Basically the Arria 2 just acts as a bridge between the Arm and the PC. Dont really know where to start with the PCIe core. Very little experience with PCI. Anyone know some design to get me going. Some nice demos of the core?
The PCI spec probably makes slightly easier bedtime reading! Once you'e loosely understood a bit about how the bus works (logically that is), you can then think a bit about PCIe. PCIe is basically PCI using hdlc frames instead of a parallel bus.The Altera PCIe slave isn't fast - when fed through PCIe->Avalon interface single cycles are positively lethargic (think ISA bus speeds, then slow it down some). Fortunately a PCIe transfer can carry a lot of bytes (probably 128) and the per-byte cost is small. Unfortunately you (almost always) need to start with a DMA engine in order to do multi-word cycles.
Hi Tricky,--- Quote Start --- Dont really know where to start with the PCIe core. Very little experience with PCI. Anyone know some design to get me going. Some nice demos of the core? --- Quote End --- I'm going to be looking at PCIe early next year. I'd be interested in hearing your experiences, and I'll feedback mine. The first thing I'd suggest is getting an example design to work, without necessarily understanding the details, just to have something working to start with. For example, I have Stratix IV GX development kits. For PCIe testing, I have a laptop with an ExpressCard that implements a PCIe-to-cabled PCIe bridge, which then connects to a motherboard containing a PCIe switch. This hardware is from OneStopSystems; http://www.onestopsystems.com/ http://www.onestopsystems.com/pcie_over_cable_z6.php http://www.onestopsystems.com/pcie_atx_bp.php I prefer this to a standard motherboard, since I can test PCIe without having to deal with an OS on the PCIe motherboard, and 'unknown' hardware on the motherboard. This hardware is basically a PCIe switch. Here's my plan of attack; 1) Test the PCIe examples that come with the kit. * I did this, and some of the examples appeared to work, while others did not. I'll start with a working example when I move onto the next step. 2) Simulate one of the working designs using Modelsim with a PCIe BFM * Based on I think your earlier post, it appears Altera dropped BFM support circa Quartus 11.0. I'd install that version just to get the BFMs. 3) Make sure I can access the hardware from Linux * Accessing PCI and PCIe boards under Linux is very easy. No device driver development required until you want to use interrupts. Even when you need to write Linux drivers, they're pretty straightforward. 4) Once I can access the PCIe interface from Linux, then generate transactions to the core and use SignalTap II to capture transaction traces. Compare those traces to the simulations in (2). Adjust the BFM stimulus to match the hardware. At this point, I'll have a working hardware/simulation setup, so then I can start on a specific design. Basically I'll just write a PCIe-to-Avalon-MM master (or test whatever Altera provides). I'm interested in seeing how well Linux deals with PCIe hot-swap. Cheers, Dave
--- Quote Start --- The PCI spec probably makes slightly easier bedtime reading! --- Quote End --- The PCIe spec is $3500. http://www.pcisig.com/specifications/order_form I doubt it is worth reading though. It'll be mostly about electrical and PHY layers that are taken care of by the Altera IP. The specifications worth reading are the ones related to the physical form factor you are putting the card into, eg., Compact PCIe (or whatever they call it), AdvancedTCA, AdvancedMC, etc. Those specifications are about $100 to $400 each from PICMG. https://www.picmg.org/v2internal/specorderformsec-nonmember.htm But you can also become an affliate member for $1000 and you get everything (a stack of specifications about a foot high). That's my recommended bedtime reading :) Cheers, Dave
I don’t agree. You need both the PCI spec and the PCIe spec – you won’t make it with either or none of both. The PCI spec will teach you all about the bus enumeration, config spaces and such, while the PCIe spec contains only the (added) information relevant for PCIe like the transactions, transaction ordering, etc.Remember: You need to be PCI SIG member to get your own vendor ID assigned and reserved. And you must remain member while you are using this ID (for new products at least).
--- Quote Start --- I don’t agree. You need both the PCI spec and the PCIe spec – you won’t make it with either or none of both. The PCI spec will teach you all about the bus enumeration, config spaces and such, while the PCIe spec contains only the (added) information relevant for PCIe like the transactions, transaction ordering, etc. Remember: You need to be PCI SIG member to get your own vendor ID assigned and reserved. And you must remain member while you are using this ID (for new products at least). --- Quote End --- Ok, so help convince us why a PCI/PCIe developer would need to spend $1000 on the PCI spec and another $3500 on the PCIe specification. Remember, we're developing with existing PCI/PCIe cores, not trying to develop our own. If I was going to develop a core, then by all means I would buy the specifications. If the specifications were both $1000 each, I'd probably just buy them too. I have the PCI specification, and sure, while it does contain the 'official' wording, I didn't find it added much over the PLX documentation for PCI devices, or the Altera/Xilinx/Lattice/etc PCI documentation. With a PCI bus BFM and then hardware and a PCI bus analyzer, development was straight-forward. Because I was developing for CompactPCI, I definitely needed that specification, but for its mechanical details, not so much of how it use PCI (other than pin assignments). Now, diving into PCIe, the transaction discussion should be fairly well described in Altera's PCI documentation, though, I wouldn't be surprised if it was hazy in places :), but any haze should be cleared up using a well-written BFM. Ok, so Altera has stopped distributing their BFM. So, would I be better off spending $3500 on a PCIe BFM? Thanks for the feedback! Cheers, Dave
Apparently the BFM should be back for Q12, but for now I can just use 10.1 (that has the BFM too).The main interface is just an Avalon-ST or MM interface, so Im hoping I can just squirt a load of data in and let it get on with it!
I think the two specs can be ordered for $3100 total. Consider becoming PCI SIG member for a year ($3000), then each spec is $50.I would not recommend visiting some darker areas of the internet that distribute prereleases of the specs. While transporting the main ideas of PCI/PCIe, those documents are different in subtle ways and might lead to more effort in finding out what’s wrong. If you go for a MM interface then, yes, most of the work is already done by Altera. Just ‘choose’ a valid/unique vendor/device ID pair and play with the other parameters a little bit. But if you go for AST or a similar PCIe transaction-level interface, I’m pretty sure you will have a hard time understanding all about credits and transaction completion as well as handling the error interface, MSI/MSI-X, transaction priority, ordering and timeout. These things are all described in the PCIe spec, and Altera just describes the relevant parts of their IP, not the underlying techniques and ideas.
Hi Matthias,--- Quote Start --- I think the two specs can be ordered for $3100 total. Consider becoming PCI SIG member for a year ($3000), then each spec is $50. I would not recommend visiting some darker areas of the internet that distribute prereleases of the specs. While transporting the main ideas of PCI/PCIe, those documents are different in subtle ways and might lead to more effort in finding out what’s wrong. If you go for a MM interface then, yes, most of the work is already done by Altera. Just ‘choose’ a valid/unique vendor/device ID pair and play with the other parameters a little bit. But if you go for AST or a similar PCIe transaction-level interface, I’m pretty sure you will have a hard time understanding all about credits and transaction completion as well as handling the error interface, MSI/MSI-X, transaction priority, ordering and timeout. These things are all described in the PCIe spec, and Altera just describes the relevant parts of their IP, not the underlying techniques and ideas. --- Quote End --- Excellent advice, thanks! Any other good references on PCIe? How's your experience with the Altera core? Any warnings? Any suggestions for alternative BFMs? Cheers, Dave
Hi Dave,I don’t know which interface you’re gonna use. If you are going for Avalon ST or another transaction-level interface with DMA, you sure will need an in-depth knowledge about inbound completion credit calculation. There is a well-written document from Xilinx which covers some possible algorithms in perfect detail, see the virtex-6 pcie user guide (http://www.alteraforum.com/forum/www.xilinx.com/support/documentation/user.../v6_pcie_ug517.pdf), Appendix E. There are at least two pitfalls I noticed when developing with Altera on the transaction-level AST interface. The first is that there is a signal rx_st_mask<n> used to indicate that your logic is not capable of receiving any more non-posted requests, like PIO Read requests from the CPU. There are two sad things about the specific operation: You must accept up to 14 (AST 64 bit) or even 26 (AST 128 bit) more non-posted requests once this signal was asserted. Together with the requirement to not hold incoming completions and posted requests just because of a busy read completion operation, you cannot simply de-assert rx_st_ready<n> – remember, there are transaction ordering rules in PCIe. You are asking for trouble in form of deadlocks if you refuse to receive incoming transactions just because you have an outbound completion (for a read request) blocking your RX port. End result: You need a dedicated fifo on the RX port capable of holding at least 14 non-posted requests (64 bit interface assumed) – even better, make it 16 or 20 so that you don’t trigger rx_st_mask<n> right away when the first non-posted request is received. This is different from, e.g. Xilinx, where this part of the buffering and transaction reordering is done by the IP (see Table 2–13, signal trn_rnp_ok_n). De-assert rx_st_ready<n> only for those times when your internal processing (not PCIe TX related) doesn’t allow any more data, like a full received completion data buffer or a full received PIO posted write data buffer. The second topic you have to keep in mind when designing for Altera PCIe: Any outbound transaction must be maintained at line rate and you have to be prepared to stream the whole transaction to PCIe at once. While there is tx_st_valid<n> which suggests (from the Avalon-ST spec) that you can insert wait states into the data stream at will, the signal must stay asserted between tx_st_sop<n> and tx_st_eop<n> (while the IP is ready by asserting tx_st_ready<n>), you are not allowed to de-assert it just because you cannot supply the data at full rate and have to wait for it. Again, this is different from Xilinx where you can choose to use such a streamed mode of operation (trn_tstr_n='0') or use an IP-level buffer (see Table 2–12 in the above mentioned document). Bottom line: Either design your data source to supply data at full rate, or add an explicit transaction fifo that starts to transmit transactions to the IP only when they are completely written to the fifo. Side note: This comparison with the competitor is not meant as an advertisement or as a list of all differences between the different IP core interfaces – there are significantly more – but to point out the major pitfalls where the designer’s assumption about the IP core might not match the actual implementation, and the Altera UG for PCIe wording might be interpreted wrongly at first reading. One thing that is still not guided by Altera correctly, is the completion timeout mechanism. PCIe requires the application to perform the completion timeout which means that any outbound posted request – i.e. DMA read request issued by the application – which does not receive any or enough completion data within 50 μs to 50 ms (PCIe suggests to not timeout quicker than 10 ms), must abort or retry the operation and indicate a fatal or non-fatal completion timeout error on cpl_err or cpl_err, respectively. If you wonder how this is done in the Chaining DMA design example – stop wondering, it is actually not implemented :(. Even more, the IP core claims that it handles unexpected completions properly, especially if “[…] The completion packet has a tag that does not match an outstanding request.” (ref: Table 12–4, page 12–4 of the current UG). I would like to ask Altera how they think the IP knows which transactions are outstanding if the application is responsible for invalidating requests based on the timeout mechanism. At the end of the day, the application has to perform completion filtering by itself rendering this IP automatism useless or even wrong. – Matthias
Hi Matthias,Thanks for the warnings! I'm sure we'll be having more discussions when I get started on this in a few weeks. --- Quote Start --- I don’t know which interface you’re gonna use. --- Quote End --- I'm not sure yet either. I need to transfer relatively small volumes of data between multiple FPGA-based boards and a host CPU. I'll be using DMA, but have still to investigate what I'll need internal to the FPGA. I'll ask for your advice when I take a look at the existing infrastructure. Thanks again! Cheers, Dave
Hi,I'd recommend you start with this Altera PCIe reference design "PCI Express to DDR2 SDRAM Reference Design". Read the User Guide for this ref des and also the Altera PCIe Compiler User Guide to get started. You don't have to bother with the PCI/PCIe spec for now. See the link below for more details. http://sites.google.com/site/ednalabs/project/fpga/tips/pcie-design-with-altera --- Quote Start --- Ok - Ive been staring at the PCIe Userguide for a couple of days - and I dont really know where to start with this thing. We have a design that requires the following setup: PC (currently windows, later Linux) <-- PCIe x1 --> Arria2GX190 16k Shared ram <-- Control Interface --> Arm Chip Basically the Arria 2 just acts as a bridge between the Arm and the PC. Dont really know where to start with the PCIe core. Very little experience with PCI. Anyone know some design to get me going. Some nice demos of the core? --- Quote End ---
I'll add my input and hope it doesn't put you off. As a background I've been doing FPGA design for about 15 years now, mostly with Altera devices with Quartus and Modelsim. I don't know the tools intimately but I have a fair idea how to drive them.Before you start I would certainly read round the PCI interface. You don't have to know anything about the physical layer or the transactions but the configuration, BAR set up etc is useful/ required background. I designed my own PCI slave core so bought the PCI spec (it was $400 odd from memory) but some judicious Googling will probably get you a PDF. I had a design I had done for a client with a Cyclone III on a PCI board I had designed. This used my logic with the Altera PCI master core memory mapped interface. This worked well, DMA to memory, interrupts etc all working nicely. My client wanted the same thing in a PCIe board. I designed a new board with a Cyclone IV intending to use the hard IP with a MM interface. that was the easy bit. This is off the top of my head so a few of the small details might be slightly wrong... As you say when it comes ot the PCIe core the documentation isn't good. There are a few example builds about but there's very little explanation about what's going on if you want to start from scratch with a 'clean' build. The example build in the Wiki for the Cyclone IV with a single lane is (/was) broken. There's a reconfiguration module that has to be included with the PCIe hard core. This drives the transceiver optimisation at power up (amongst other options). In the example Verilog project the signals to and from this to the core aren't declared in the code. I only found out after some googling that undeclared verilog wires are implicitly 1 bit wide so the build fails with a critical error that this module is unconnected. (It works though). This example has many virtual pin, and other unexplained assignments in the qsf. The qsf is ~26kb. The example design in the Quartus installation directory is _very_similar (not broken) to the wiki example but it's not a complete project. This doesn't include any of the virtual pin assignments in the wiki example. From memory the qsf is ~ 1Kb. With the Altera PCI core a script was generated to apply all the necessary constraints. With the PCIe core the same script constrains _1_ clock. There is little/no explanation as to what constraints are required for the hard IP implementation. The multitude of assignments in the Wiki example really muddies the water here. I used Qsys to generate the hard IP core with a MM interface. When the core is generated there are a lot of 'spurious' ports added, these pertain to the PIPE and Test outputs. There's no explanation what to do with them. In the Wiki example above many of these are left unconnected (even inputs). In my VHDL instance I have brought them to the top level and assigned 'Z'. to inputs, leaving outputs 'open'. Also be prepared to be disappointed if you want to simulate in VHDL. From what I understand the support for VHDL simulation has been dropped, although I wonder if that's a side effect of the PCIe BFMs having been dropped. It'll be interesting to see what's included with QII V12.0 I'm sure I have had other problems but as I said this is a quick rant off the top of my head. I am really un-impressed with the quality of the documentation, quality control with the example build etc, it is un-acceptable. I am in the position now where it is all working apart from our interrupts failing after a period but I'm unsure if that's a HW/FPGA or SW/Driver problem yet. Good luck. Nial
I have been promised the BFM should return in Q12 (so for now, Im pumping out anything I want to look at in modelsim with 10.1).There is VHDL support, and the core can be generated fully (with working examples) in either. But afaik, it was all origionally created in verilog, and the VHDL isnt much more than a "port" /auto code-conversion. It works though. I have also raised issues with 1 reference design being supplied with a signaltap file with all the signal references being wrong ( I assume they updated the ref design without updating the ST file - whoops) and subsystem and subsystem vendor IDs not maching those expected by the supplied PC driver (so even though windows thinks its not working, the demo GUI works fine!) luckily, I dont think I will need anything too complicated. I have the attached architecture (see attached image) to implement. With it only being a 16k shared RAM, Im hoping I can just use legacy interrupts and just service mem read/write request, and send back the completions. So - any help appreciated :)
Shared memory is easy. Just responding to posted and non-posted requests from the host CPU is not that much of a burden for the FPGA designer. But shared memory is typically slow, especially for host read requests, and depending on the BAR settings it may be hard to design it properly and consistently.Remember: CPU transactions to PCIe devices, especially non-posted requests, are (very!) costly in terms of CPU performance. Depending on your motherboard and system I/O load, a PCIe read transaction takes typically 0.5 to 2 us. During that period of time, the CPU is completely on hold – read: 100% CPU load for each request – and in multi-core systems this typically affects all cores (!) due to memory access transaction ordering constraints. If the BAR is marked as prefetchable, these numbers do improve at the cost of spending additional effort on getting the consistency right. Shared memory approaches are good if the advantages of a completely generic one-catches-all approach beats its disadvantages, most notably the (read) performance. If you want to gain transfer speed and CPU performance, especially if data has to travel from the PCIe device to the CPU, you will find no easy way around DMA. The CPU+chipset are way faster in accessing the data from main memory than to fetch the data word by word from the PCIe device. The best approach is to think in transactions: If the CPU has to push out one or multiple messages, it should write it to main memory, update the related descriptor list – think: a fifo holding message pointers – and notify the PCIe device of this change. The device will then read the descriptor and issue more DMA read requests to fetch the actual data from main memory, finally updating the status and indicating termination of activity to the CPU with an interrupt. In the meantime, the CPU was not stopped in any way and could spend its time servicing other tasks, probably generating new messages to the PCIe device. The same approach is used in the other direction: If the PCIe device has to notify the CPU of the arrival of new data or a state change, it uses another descriptor list entry, pointing to a free main memory location, and writes the message there via DMA, followed by an update of the descriptor (typically about the status of the reception and the message length), finally indicating an interrupt to the CPU. The CPU can then fetch the message quickly and efficiently from main memory. – Matthias
Some CPU have dma engines tightly coupled with the PCIe interface - this means that they can do a multi-word transaction. Since this typically takes just as long as a single word transaction it can make sense to spin the cpu waiting for the dma to finish (the interrupt entry/exit times would probably be longer that the cycle time).
Well, modern top/bottom-half driver architectures manage their interrupt load based on the current CPU/packet load. At a low packet rate or CPU load, each packet might issue a call to the interrupt handler (the top half) while at a higher packet rate, interrupt generation is only requested by the driver if all pending work was done. So, at a high load, the interrupt entry/exit overhead would vanish while your suggested spins would not only be more frequest at higher packet rate but also last longer if the system is loaded with I/O requests. And, of course, the spins eat precious CPU cycles that should be spent handling the high load.– Matthias
I'm some of the way through a PCIe design with a Cyclone IV. So far I've learned:A. Start with a working design from a dev kit or eval kit. That way you start with known working hardware, FPGA design and windows driver. If you don't start from there you will never know if your problem is with the FPGA, hardware or windows driver. B. Altera's eval kits break easily. The FPGA design only seems to get tested on the version of Quartus that was current when the design was done. Later versions of Quartus may break it! The reference designs from MEV are much better. C. Using an Avalon-MM based solution and SOPC builder works. Unfortunately, as others have warned, it's really slow! D. Don't bother with getting the PCIe specs until you know what you don't know. Does anyone know an easy route to a DMA solution?
Oliver,I agree on A, had no issues with B, never tried C for the reasons you mentioned and directly went to a self-written design based on the Avalon-ST interface. Sad thing is, in this case I have to disagree with your statement made in D, you have to know way more about PCIe transactions than is documented in the Altera PCIe UG unless you want to debug all your issues out of the design at high development cost – late in the design, some large-scale architectural changes, other team members starving upon system integration without stable hardware and/or drivers. If one wants to follow my trail, I would advise to take at least two steps: first write a simple PIO-based design with simple hardware on Avalon ST and an inefficient driver that already mimics an application interface that is close to the final design. Then, replace more and more hardware+driver with an improved design based finally on DMA. I have learned quite a lot about DMA-capable hardware and proper driver interfacing by reading and understanding some well-written and high performance Linux drivers together with the public data sheets of the devices. In addition, when designing for Linux or reviewing Linux device drivers, ldd3 (http://lwn.net/kernel/ldd3/) is a valuable resource for understanding how the driver interacts with the operating system. Furthermore, one learns which interface optimizations have been applied by the kernel designers to let high-performance DMA-capable hardware do as much as possible for offloading the CPU, so the design can benefit from implementing one or the other optimization right into the DMA engine instead of forcing the driver or the user code to do that work, especially avoiding data copying (zero copy (http://en.wikipedia.org/wiki/zero-copy)). Things are slightly different with Windows, as it supports different application interfaces and there is less driver code to take as an example. But the basic operation modes are similar to the ones with Linux, and there is very rare cause to optimize hardware just for the needs of a specific OS. Most important at this stage of development is to know the order of operations and which PCIe transactions and semantic packet transactions can overlap, i.e. accesses to data, the descriptor queues and hardware registers by device and driver. Avoid race conditions by design, avoid interrupt oversampling. In my case it was highly beneficial that I was the one to implement the PCIe application in hardware as well as the one to write the Linux driver, so the development loop was very short and any bad architectural decision on one side was quickly and with no emotions revised on the other side. Another important thing are the no-snoop and relaxed ordering transaction bits. Making use of them properly not only gains performance but indicate that the hardware designer has a good understanding of semantic transactions. For example, when sending data from the device to main memory via DMA, I have three different transactions to handle: First, I write the data with no-snoop (the memory is assigned to the hardware at cache line boundaries at this time) and relaxed ordering, then I update the descriptor with just relaxed ordering on, and finally I update the queue head pointer in main memory without any of these attributes, pushing out all outstanding requests to main memory. This ensures that the driver can only see the updated head pointer when all frame and descriptor data was updated, but their updates can be re-ordered before that with other transactions for improved performance. Furthermore, inside the hardware design, updates to the descriptor queue can be combined for multiple descriptor entries, and updates to the head pointer can be deferred until some descriptors are filled or there is no further data pending. In my design, the high-level driver states are handled with kind of a virtual token: While the driver owns the token, the hardware is not allowed to issue an interrupt. At this time the driver handles as much work as possible in its bottom-half, perhaps scheduled for multiple execution runs by the kernel. When the driver is done processing the queue – it is empty for RX or full for TX – it stops the bottom-half process, indicates to the hardware what was the final task it could do, and the hardware takes the token. As soon as the hardware gets aware of new tasks to do by the driver – even tasks that were already written to the descriptor queue at the time of receiving the token – the hardware interrupts the driver, handing back the token to the driver. Remember: The token is precious, and there shall only be one token, so take care not to lose or duplicate it. Another step, which depends on the target performance in terms of packets per second as well as CPU load, is the attempt to minimize PIO read operations from the CPU to the device. I currently have just one read request inside the ISR that checks for the cause of the interrupt by looking at a status register, and with higher load, the bottom-half of the driver is going into polling mode, so the ISR is called less frequently down to zero times at full load. The bottom-half does not contain any CPU read requests from device memory or registers, it just handles data, descriptors and pointers in main memory except for head/tail pointer updates written to device registers. – Matthias
Matthias,I've just reached the point where it's obvious that our design must use DMA. This is a significant setback as writing a custom DMA driver is going to impact our timescales quite badly. My knowledge of PCIe transactions is negligible, so we are looking for additional expertise in this area. It's a shame that there is no generic solution from Altera and that every user must make their own solution from scratch. Oliver