Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12589 Discussions

NIOS II 32 bit data master -> Avalon MM PCIe 64 bit TX slave

Altera_Forum
Honored Contributor II
2,380 Views

I had a system working where the DMA controller ( 64 bit ) was reading and writing to the TX port of the PCIe TX port . 

I now have a situation where I want the NIOS II data master ( 32 bit ) to read and write the TX port of the PCIe IP to get to system memory via the PCIe link. 

Writes seem to go nowhere and reads hand. 

 

I believe the fabric should take care of the 32 -64 bit data width mismatch but not sure. 

Also not sure exactly how to make the NIOS II read or write .. I have an int pointer that I am making *ptr references via where ptr = the QSYS address of the TX port. 

The TX port has 2 mappings or 4k byte blocks . 

I could try getting the NIOS to set up the DMA transfer to see if that works better but really want the NIOS II to make references to system memory via the PCIe link. 

 

Also , the QSYS says that address bit 31 must be '0' ... for NIOS ,... is there a reasone for that ? 

 

Thanks in advance, Bob.
0 Kudos
13 Replies
Altera_Forum
Honored Contributor II
791 Views

Ok fixed half of this issue ... 

 

The Linux device driver I have written ( based on other examples ), enables the PCIe endpoint from the probe function and that leaves the device 

command register at 0x00000142 . I was not getting any requester traffic either where NIOS or DMA initiates the read or write. Anyway I needed to add the 

system call pci_set_master(dev) to enable the device for mastering ( requester in addition to completer ).  

 

I am now able to have the system as the "producer" and the NIOS as the "consumer" where the producer sets the data complete flag in system memory 

and the "consumer" polls that flag then reads the data from NIOS memory, sets a status in NIOS memory that indicates the data is consumed and the  

"producer" polls that flag and the cycle repeats. 

 

The second issue I have relates to setting up the PCIe core Avalon MM -> PCIe address translation .. this is at a table with 2 entries of 4k bytes each. 

I am trying to track it down . This Cra register space is accessed via BAR1 and contains the physical address of the system DMA memory address. 

For some reason , that BAR1 write is ending up at a different location in IMEM that is just below the Cra space . I have an idea it could be something 

to do with the BAR address matching scheme and reference to NIOS address map. The IMEM is at 0x00010000 - 0x0001ffff and the CRA slave is at  

0x00020000 - 0x00023fff ... the write to the translation tables at offest 0x00021000 in the CRA space seems to end up at 0x00011000 in the IMEM. 

 

Thanks, Bob.
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

It is certainly much easier if you use a single BAR to access all Avalon slaves, and make the base of the addressed area include Avalon address zero. 

Basically the PCIe slave block removes the high address bits from the PCIE address, and then (effectively) substitutes a different (fixed for each BAR) set of high address bits. The sopc code had some strange restrictions on where BARs could address, qsys may have the similar ones. 

I suspect that they've tried to make it 'simple' and only succeeding in making it confusing! 

 

The nios cpu (without mmu) uses the high address bit to mean 'cache bypass' - so it can only generate 31bit addresses (including the address bits that convert to byte enables). 

 

The PCIe master interface is rather horrid - the requirements don't really match that of an Avalon slave. 

For single-cycle pio requests I'd be tempted to write an Avalon slave that can latch the required 64bit pcie address and data, and then be told to perform a single master transfer - the poll the slave interface for when the request finishes. 

(A bit like a very degenerate dma controller.) 

For longer transfers a dma controller that can read avalon data and the burst write to the pcie slave (or burst read the pcie slave and write to avalon addresses) would be useful - with the nios cpu polling for completions and managing any request queue. 

But I can't immediately see how to use any of the existing dma controllers for that purpose.
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

 

--- Quote Start ---  

It is certainly much easier if you use a single BAR to access all Avalon slaves, and make the base of the addressed area include Avalon address zero. 

Basically the PCIe slave block removes the high address bits from the PCIE address, and then (effectively) substitutes a different (fixed for each BAR) set of high address bits. The sopc code had some strange restrictions on where BARs could address, qsys may have the similar ones. 

I suspect that they've tried to make it 'simple' and only succeeding in making it confusing! 

 

The nios cpu (without mmu) uses the high address bit to mean 'cache bypass' - so it can only generate 31bit addresses (including the address bits that convert to byte enables). 

 

The PCIe master interface is rather horrid - the requirements don't really match that of an Avalon slave. 

For single-cycle pio requests I'd be tempted to write an Avalon slave that can latch the required 64bit pcie address and data, and then be told to perform a single master transfer - the poll the slave interface for when the request finishes. 

(A bit like a very degenerate dma controller.) 

For longer transfers a dma controller that can read avalon data and the burst write to the pcie slave (or burst read the pcie slave and write to avalon addresses) would be useful - with the nios cpu polling for completions and managing any request queue. 

But I can't immediately see how to use any of the existing dma controllers for that purpose. 

--- Quote End ---  

 

 

Ok DSL thanks ... I believe since I started the BAR1 accessable at 0x00010000 ( scratchpad IM ) .. so if offsets from the primary decode of BAR1 are used ... I may need to adjust the PCIe references 

to BAR1 + 0 to get to the first scratchpad IM location ... and not use 0x00010000 which is the address when referenced by the NIOS master. 

 

One thing also , when I do read from the NIOS to the RC ( ststem memory ) , Gen1 X1 , the time from the read on the link to the read completion is a whopping 900 nS approx. Since the ARM system (RC ) is running its DDR at 400 Mhz, I'm trying to figure out where the 360 clock cycles went. 

 

There are some theories ... 1) the link is going into L0s ... some low power mode but I don't see gaps of more than several mucro seconds so doubt this . 

2) the read completion is queued behind posted writes ... maybe but I read the same kind of latency when there are no posted writes . 

3) read completion is waiting for some token / resource at the FPGA PCIe endpoint ? 

 

Any other ideas ? Thanks Bob.
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

900ns wouldn't surprise me. 

Single transfers into the fpga are similarly slow (IIRC 600-700ns when running the FPGA at 100MHz). 

PCIe is a high-throughput high-latency 'bus' and isn't really suitable for PIO accesses. 

I've not timed transfers into other PCIe slaves to see how slow they are, but they won't be fast. 

The elapsed times are slow because PCIe is a comms protocol, not a bus protocol. 

 

So even for relatively low throughput you need a dma controller to request single PCIe transfers for upto 128 bytes (typically the limit for a single transfer).
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

 

--- Quote Start ---  

900ns wouldn't surprise me. 

Single transfers into the fpga are similarly slow (IIRC 600-700ns when running the FPGA at 100MHz). 

PCIe is a high-throughput high-latency 'bus' and isn't really suitable for PIO accesses. 

I've not timed transfers into other PCIe slaves to see how slow they are, but they won't be fast. 

The elapsed times are slow because PCIe is a comms protocol, not a bus protocol. 

 

So even for relatively low throughput you need a dma controller to request single PCIe transfers for upto 128 bytes (typically the limit for a single transfer). 

--- Quote End ---  

 

 

DSL, sounds like I'm in trouble ...  

I tried the 32 bit word read from the other direction ... RC reading the FPG memory .. and got a similar result around 900 nS from when the read command is received and the start of the read completion.  

 

In addition our Linux DD guy suggested to time the RC Config read ... I timed the config read of the Command register and it also was say 840 nS. 

In this case there is no interaction on FPGA fabric or memory ... just a register read. I still can't explain this since the FPGA and ARM SOC are running  

different IP maybe the same ref_clk and the results ae about the same ... why am I in trouble ... I am trying to test for a race condition and to test for that , I need the window between read and completion to be small enough to identify a race involving PCI ordering issues. 

 

I will need to see if there is a work around. 

 

Best Regards, Bob
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

Possibly stall/rerun the PCI transfer for long enough to setup the race condition? 

 

PCIe will never be low latency. A PCIe request is an hdlc frame containing the address, length (etc) and any data, this has to be decoded and verified (etc) before being actioned and then the response hdlc frame generated. All this takes time. 

 

To efficiently use PCIe the whole logical interface has to be arranged to use DMA wherever possible and to minimise the number of PIO reads.
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

 

--- Quote Start ---  

Possibly stall/rerun the PCI transfer for long enough to setup the race condition? 

 

PCIe will never be low latency. A PCIe request is an hdlc frame containing the address, length (etc) and any data, this has to be decoded and verified (etc) before being actioned and then the response hdlc frame generated. All this takes time. 

 

To efficiently use PCIe the whole logical interface has to be arranged to use DMA wherever possible and to minimise the number of PIO reads. 

--- Quote End ---  

 

 

Thanks , DSL, 

 

I'm not really interested in throughput on the PCIe link. I am interested in exercising a bridge inside a SOC which has PCIe data passing through 

it. The Producer / Consumer test is used to stress PCI ordering rules in the bridge and in the case I am looking at .. that Posted writes to 

my FPGA endpoint memory stay ahead of Read completions , where the Read was initiated by the FPGA and if the read returns a '1', indicates the  

data at the FPGA memory is valid. Since I see a ~900 nS Read -> Read completion delay, the only way my test will mean anything is if the Posted 

Writes get backed up in the bridge when the Read for the Read completion arrives at the bridge. 

 

So ... I assume I can't do much about the ~900 nS read latency and I will look at the PCIe complier to minimize the tokens for the write buffer . 

I hope this will put back-pressure on the bridge I am testing and cover the ~900 nS of Read latency. 

 

Thanks For your input. 

 

( BTW: I was at Hemel Hempstead for 18 months at Marconi /GE and really liked it ).
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

After some discussion I don't believe read latency is an issue as long as the reads commands arrive at the RC at random times. This will mean the read of the  

flag will at some time occur just after the system core has set the flag ... I still need to get the posted writes in the bridge at the same time as the read completion 

to check for PCI ordering rule violations. The device driver guy said I need to add wmb() between the posted writes to the endpoint and the setting of the flag in  

the system memory.
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

Hi all, 

 

- I am already having a design that communicates with x86 processor from NIOS II using the shared memory over the PCIe interface. 

- With new design my aim would be to reverse the shared memory and place it on the x86 processor DDR memory instead of FPGA SSRAM. 

- And aware of this would require some complex address translation logic to be included in the fabric. 

- I came to know that, "txs" signal in the PCIe interface ip core will access the host memory, but I want to know how that signal will access DDR memory or some other internal memory on x86 processor. 

- Also I like to know how DDR memory in x86 processor is used? 

- I am interested to know if someone has already achieved something similar and if it is possible to get hold of a reference design for this or related configuration to start with.
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

 

--- Quote Start ---  

Hi all, 

 

- I am already having a design that communicates with x86 processor from NIOS II using the shared memory over the PCIe interface. 

- With new design my aim would be to reverse the shared memory and place it on the x86 processor DDR memory instead of FPGA SSRAM. 

- And aware of this would require some complex address translation logic to be included in the fabric. 

- I came to know that, "txs" signal in the PCIe interface ip core will access the host memory, but I want to know how that signal will access DDR memory or some other internal memory on x86 processor. 

- Also I like to know how DDR memory in x86 processor is used? 

- I am interested to know if someone has already achieved something similar and if it is possible to get hold of a reference design for this or related configuration to start with. 

--- Quote End ---  

 

 

Hi Varun, 

 

This link will take you to various Altera Reference designs for PCIe. //www.altera.com/products/reference-designs/ip/interface/m-pci-express-refdesigns.html 

 

The TXS port is an Avalon MM slave port used to generate Inbound PCIe Read and Write operations to a RC.  

 

I am not familiar with "DDR memory in the X86 processor" ... but Host system will have a memory map for Memory / Register space that is accessible from the PCIe RC. I assume you want to get a DMA operation from the EP to the Host DDR memory. 

 

Regarding translations, the PCIe IP / Avalon Bridge has local translation tables configurable by the user to be a static translation or configured by NIOS II core. These translations can be 32 bit Avalon MM-> 32 bit PCIe address or 32 bit Avalon MM -> 64 bit PCIe. 

 

There are other reference designs like Gen3x8 with DMA that use a 64 bit Avalon MM address -> 64 bit PCIe address without any translation. 

 

Best Regards, Bob
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

Hi Bob, 

 

Thanks for your reply. 

 

I will check this and revert you back. 

 

Thanks & Regards, 

Varun
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

Hi Bob, 

 

I started to create the design, that access the x86(Atom) processor DDR3 memory from Nios II through PCIe interface using "txs" signal. But I required some help from your side. 

- I went through user guide for PCIe IP core of Altera and the internal operation of "txs" signal are not very clearly documented. So, can you refer me any specific document on how PCIe IP "txs" signal works. 

- What should be the physical address that would be assigned to "txs slave port" in PCIe hard IP? 

- As of my understanding, the hardware(Qsys) side connection is only from Nios II to PCIe Hard IP "txs" signal. All other access should be done on software side that is: 

- how x86(Atom) processor DDR3 memory is accessed in software? 

 

It is possible to share software reference to access x86(Atom) processor memory using "txs" signal of PCIe hard IP. 

 

Thanks & Regards, 

Varun
0 Kudos
Altera_Forum
Honored Contributor II
791 Views

Hi Bod, 

 

I have created the hardware design with Nios II connected to x86 via PCIe hard ip to access the x86 memory.  

I like to know how to access txs signal in software? 

Can you please share me if you have any software file for Nios II SDK. 

 

Thanks & Regards, 

Varun
0 Kudos
Reply