Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

Linux-Driver for Altera PCIe HIP or simple methode to access PCIe

Altera_Forum
Honored Contributor II
4,508 Views

Hi everyone! 

 

I know it's an old problem, but I don't see it solved yet. 

I' currently working on a FPGA-Design on the Arria II GX Development Kit. I made a I2C-Core which I now have to feed with some bits and bytes (Device adress, Datas and so on (I think PIOs are a good choice)). I connected my core over the PIOs with the PCIe Compiler in the design example "a2gx_qsys_pcie_gen1x4" (from http://www.alterawiki.com/wiki/pci_express_in_qsys_example_designs) and kicked out the On-Chip-Memory because I placed my PIOs on its base-adresses. 

My host-pc has to be a Debian 6.0. 

Now I just want to access the base-adresses of my PIOs to check if my core works. Just like the "Simple version of software source code" found also here: http://www.alterawiki.com/uploads/b/b4/alt_pcie_qsys_simple_sw.zip 

I've tested it long ago with windows, but it uses a jungo-driver which only works for 30 days. 

I only need a very simple driver. I just want to read/write on the base-adresses.  

My biggest problem I just have absolutely no knowledge about Linux-Drivers. 

Is there someone out there who can help me a bit getting started? 

Or has Altera finally wrote a useful but simple driver? 

 

I've already tried these: 

http://trac.assembla.com/altpciechdma/browser/  

: seems incomplete and hasn't been accessed for four years now. 

 

ftp://ftp.altera.com/up/pub/altera_material/12.0/tutorials/using_pcie_on_de4_design_files.zip 

ftp://ftp.altera.com/up/pub/altera_material/12.0/tutorials/  

: looks good, at least I THINK I installed the module but then something crashed. 

 

Please Help. 

 

Thank you.
0 Kudos
13 Replies
Altera_Forum
Honored Contributor II
1,498 Views

You don't need a driver to access PCIe devices under Linux, you can use the sysfs nodes that Linux automatically sets up for you. 

 

Download the examples on this thread 

 

http://www.alteraforum.com/forum/showthread.php?t=35678 

 

Build and run the pci_debug tool. 

 

Cheers, 

Dave
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

WHAAAAAAAAA!!!!! 

 

IT WORKS!!! 

 

Thank you very much. This was exactly what I needed. Somehow I knew it must be easy but I didn't find the way. 

 

Again thank you.
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

Glad to hear it helped :) 

 

Cheers, 

Dave
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

Now I have a little speed problem: 

I made a design in qsys where a 32bit fifo is connected to the avalon bus on both sides (in and out). It has a depth of 8192. 

With a litte addidion I used the pci_debug - program to messure the time for write and read operations with the following code: 

 

errorcounter=0; writedata = 0x00000010; clock_gettime(CLOCK_REALTIME, &start); for (i=0; i<8190; i++) { *(volatile unsigned int *)(dev->addr + 0x09400000) = (writedata * i); } clock_gettime(CLOCK_REALTIME, &end); clock1 = (unsigned long)(end.tv_nsec - start.tv_nsec); clock1sec = (unsigned long)(end.tv_sec - start.tv_sec); readdata = 0x00000010; clock_gettime(CLOCK_REALTIME, &start); for (i=0; i<8190; i++) { readdata =*(volatile unsigned int *)(dev->addr + 0x09400040); } clock_gettime(CLOCK_REALTIME, &end); clock2 = (unsigned long)(end.tv_nsec - start.tv_nsec); clock2sec = (unsigned long)(end.tv_sec - start.tv_sec); printf("Summary 32Bit Fifo:\n"); printf("8190 Cells written in %.2ld s and %.9ld ns.\n", clock1sec, clock1); printf("8190 Cells read in %.2ld s and %.9ld ns.\n", clock2sec, clock2);  

 

To my surprise the 8190 writes took 792 us (that makes about 41MB/s) and the reades took 11869 us (that makes about 2.8 MB/s). 

Why is it so slow? 

 

To my design: 

I used the "QSYS PCIe to ext memory" reference design and reduced it to PCIe 1x and the clock sources. Then I added the fifo. 

The Design works but it seems that it is slow as hell. Is there something I've done wrong or how else can I speed up the design? 

 

https://www.alteraforum.com/forum/attachment.php?attachmentid=6816  

 

I want to use a schematic file for my top-level design-file so I tried to translate the original file to a BDF-File: 

 

https://www.alteraforum.com/forum/attachment.php?attachmentid=6817  

 

Have I forgotten something?
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

The underlying problem is that PCIe isn't a 'bus' protocol, but an HDLC comms protocol. So a read/write is two hdlc packets one carrying the request and the other the response (+ the ones that generate credit). Each request can transfer a reasonable number of bytes (probably 128 or 256) - so while the acheivable throughput is high so is the latency. 

With PIO requests the reads are synchronous, so you'll almost definitely have separate requests fTLPs) or every 32bit transfer, and the transfers wont overlap. 

The writes fair a lot better, the requests can be performed asynchronously - so will overlap. 

 

The only way to get reasonable throughput is to generate TLP that request larger data blocks. Typically this requires that you use a dma engine that is tightly coupled with the PCIe master logic. 

 

Since your test repeatedly accesses the same location you are forcing small transfers be used - even if the master is capable of merging the requests (it might for writes - but that might require use of write-combining instructions). 

 

I had to write a driver (well code to drive!) for the PCIe dma engine on the little ppc we run linux on. 

 

Also, if you need to access a FIFO (rather than a memory block) you probably want to alias the fifo to a few kb of address space so that long TLP can be used to access it.
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

OK, just to understand this: 

A write-request doesn't wait for an "ack" and the next write-request can start immediately. The "ack" can be transfered while the next write is already in progress. 

A read-request has to wait for the result of the read-request before a new transaction can start, correct? 

And this is why a read-request takes about 15 times more time. It just has to wait until every transaction is finished. 

 

Did I get this right? 

 

And this gives me the next question: (I'm sorry, I'm more into hardware, this is also why i prefer to draw BDF-Files) 

How do I alias the fifo? Is this done in Software or in the design? Because I don't see any option for the fifo which means "alias". 

My main Problem is that I have to fill and empty the fifo very quick (about 1GBit/s). But not the whole time. I have much latency in my communication, only about 50% of the time is used for communication. 

 

Do you have experience in the PCIe hip? There are so many connections I don't use. Is my wiring correct for simple purposes?
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

About right - the PCIe does allow multiple outstanding reads, but a processor is unlikely to generate them for normal memeory accesses. 

 

It is worth noting that PCIe models a 64bit data bus, on the fpga a 32bit request end up going through a bus width adapter and generating two 32bit cycles - one of which has no asserted byte enables. So making the PCIe side of the fifo 64bit will remove a couple of clocks. 

 

If your linux host is an x86, you might find that SSE2 transfers generate a single TLP for 128 bits (and AVL ones for 256 bits - if supported by the cpu and the linux version you are using). 

It is also possible that unrolling the loop might generate multiple concurrent read TLPs. 

 

Neither qsys nor sopc seem to let you alias addresses. I think it can be done by feeding all the Avalon signals through a 'conduit' and replacing the relevant address bits with zero. Being able to alias internal memory would be useful for software cyclic buffers, also 9, 18, or 27 bit wide memory might also be useful for some uses (eg lookup tables). 

 

I (mostly) do software, I know the PCIe stuff caused a certain amount of grief though. There are some random, splurious, constraints about the way PCIe slave windows get asssigned to Avalon addresses. 

 

I can do reads at about 21ns/byte (using large TLP and overlapped requests, with a system call overhead for each transfer). That still isn't 1GBit/s. That is from a small ppc, NFI how to generate long TLP from any x86 cpu.
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

Hi Steffen, 

 

PCI and PCIe performance is terrible for read/write accesses. 

 

http://www.ovro.caltech.edu/~dwh/correlator/pdf/pci_performance.pdf 

 

Once you've got basic accesses working using the pci_debug tool, you can use that tool to manually program a DMA controller. The DMA controller typically resides on the PCIe peripheral board, i.e., your hardware. 

 

If you have two boards, then you can DMA between them using the addresses provided by lspci. If you want to DMA from the board to host memory, then you need to create a basic host-side driver that allows you to allocate a page of memory, and provides the physical address of that page. You can then program that info into the DMA controller. 

 

There's some driver example code that shows how to do this on this page: 

 

http://www.ovro.caltech.edu/~dwh/correlator/cobra_docs.html 

 

Its either in the "COBRA device driver" or the ESC 2006 files. 

 

Cheers, 

Dave
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

Hi Steffen, 

 

The Altera University Program has a nice write-up of a hardware design and Linux driver they have created for the DE4 board: 

 

ftp://ftp.altera.com/up/pub/altera_material/12.1/tutorials/using_pcie_on_de4.pdf 

ftp://ftp.altera.com/up/pub/altera_material/12.1/tutorials/using_pcie_on_de4_design_files.zip 

 

I haven't used it, but it looks like it would suit your needs. 

 

Note that they have 'cheated' in that the hardware design creates a 2GB Avalon-MM slave window onto the host PC address space, and then the Linux driver is forced to only allocate addresses between 0 and 2GB. This limitation is due to the fact that the Altera Qsys DMA controller is not really designed for use with PCIe. A true PCIe DMA controller would bridge between PCIe and Avalon-MM and would have scatter-gather entries with both Avalon-MM and PCIe addresses. 

 

Cheers, 

Dave
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

Wow, this is much information. Give me some time to check this. 

 

I'll be back! 

 

Thank you 

 

Steffen
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

I'd steal the PCI code, but write a character driver. 

You can then use preadv() and pwritev() in the application, and a 'first cut' driver just does copy_to/from_user() directly from the user buffer to the io_remmap()ed pcie addresses. 

 

If you remember to update the file offset, you can use hexdump as a test program (might do lots of 16 byte reads). 

 

Getting dma working involves finding the correct dma engine! Given that a 1k transfer probably doesn't take significantly longer than a single word you probably want to spin waiting for 'dma done', not an interrupt terminated dma. You also want a dma that is cheap to setup. 

 

For the ppc we use linux doesn't have a driver at all for the pcie dma block - so I could write a very simple one.
0 Kudos
Altera_Forum
Honored Contributor II
1,498 Views

Dear Experts,  

 

What do you think about this driver, it's already available in Altera Wiki. 

http://www.alterawiki.com/wiki/linux_pcie_driver 

 

Thanks.
0 Kudos
Reply