Re: Ethernet roundtrip time with NicheStack & Nios-II

Altera_Forum · ‎05-25-2012

Hi,

We are using TSE IP with NichStack, MicroC OS-II on Nios-II. FPGA is running a simple UDP host which echoes all packets it gets from client. The client is a PC running real-time OS and connected to FPGA board via direct, crossover Ethernet cable less than 5 ft. Auto-negotiation sets the speed @ 100 Mb. UDP payload is only 128 bytes. Nios-II is running @ 50 MHz on Cyclone IV. SOPC has SGDMA.

Both sides have no other processign/loading. The board is running in tethered mode (level 1 debuggin on Nios-II), as we are using OpenCore licenses.

We are seeing rather long roundtrip times (in tens of ms) in PC-->FPGA-->PC for each packet. I am looking at AN440 and "Ethernet Acceleration Design", but was wondering if there is seomthing I may be missing here.

What is typical roundtrip time for point-to-point Ethernet, real-time (both sides), closed environment like ours?

-swguy

Altera_Forum · ‎05-29-2012

tens of ms seems a lot. A NicheStack/Nios II system isn't very fast, but it should be more than capable of sending more than 10 packets a second. What Ethernet hardware are you using? The driver could use a polling mode instead of interrupt, which could explain the long time before it gets a packet.

You could also use a profiler to see where the CPU is spending some time.

Other things to look into that can affect speed are the RAM interface, CPU cache (or lack of) and optimizations used during compiling (going from -O0 to -O2 has a significant impact).

Altera_Forum · ‎05-29-2012

I run NicheStack with MicroC OS-II on a 100MHz Nios-II/f with cache on a Cyclone III FPGA and I get this figures:

- UDP (ping) : less than 1ms between request and reply, measured on the wire

- TCP : less than 5ms between request and reply, mainly due to application; actually the delay due to TCP Stack and OS is limited to about 2ms

As Daixiwen pointed out, you get major improvements with -O2 build optimization, fast Nios core with cache and higher cpu frequency.

Anyway, even with a basic system like yours (50Mhz Nios, posibly no cache, debug build configuration without optimization) IIRC you should achieve a UDP round trip time far below 50ms.

With UDP do you mean a simple "ping" command or are you implementing a UDP server in your application?

Altera_Forum · ‎05-29-2012

Thanks Cris72 and Daixiwen.

The UDP setup: PC with a RTOS is a very simple UDP client which sends UDP packets. 128 bytes payload. Nios-II app is a modified UDP server based on "Simple Socket Server" TCP template. Current round-trip time as measured on the PC client side (using RTOS APIs) is about 30-40 ms. Board is "Cyclone IV GX Transceiver Started Kit" with original SOPC.

SOPC screenshot is attached. One of the tutorials mentioned using "sdram" for "linker section names". I noticed that the board does not have "sdram"; it has "onchip_ram" and "ssram", to which the CPU intr/data masters are connected. Could it be the mem speed issue due to this arrangement?

Will set -O2 optimization. Will also increase CPU clk and look into CPU profiling.

Any idea how to check if the TSE IP driver is interrupt-driven or is polling?

We are trying to get round-trip time in sub-millisecond range.

-swguy

Altera_Forum · ‎05-30-2012

If you are using the TSE driver with ucOS and the Interniche TCP/IP stack (the default with the Nios IDS) then it is using interrupts.

Are you measuring the delay for several UDP packets or just the first one? The round-trip time for the first UDP packet may be longer because you also need an ARP request and reply. You could also use a sniffer such as Wireshark to see what is happening on the Ethernet level, but I think profiling should give you also a lot of information.

Altera_Forum · ‎05-30-2012

Sending 10,000 packets for the test. First 500 packets are not used in the stats. We are using another RTOS on PC and the RTOS uses an Ethernet card which the Windows does not "see". Wireshark media specific capturing does not list the RTOS we use, but will try.

Incidentally, the Ref. design binaries for "Accelerating Ethernet Performance", when run on Stratix-II Nios-II development board shows roundtrip time of under a millisecond. The design does have number of optimization, including checksum in hardware. The document is old as the board is not supported anymore, but pointers do help nevertheless.

thanks,

-swguy

Altera_Forum · ‎05-31-2012

If you have a manageable switch somewhere and can configure one of its port as a mirror, then you can plug it between your PC and the FPGA board, and monitor that port with a second PC and Wireshark. It can seem overkill but Ethernet sniffers are great debugging tools.

Checksums on 128 bytes payload shouldn't take that long, and besides it is optional with UDP. If possible, you could try and see if you can configure the RTOS on the PC so that it doesn't generate the checksum and check if it changes anything on the round trip time.

Altera_Forum · ‎06-06-2012

I could bring down the round-trip time significantly ~ 60-70% by doing few things specified in AN440 and from pointers by Daixiwen and Chris72. Thanks folks.

The round-trip time is still in few milliseconds. The Cyclone IV device and the eval board has limited resources. So I could not implement "Fast packet memory" and "Checksum in hardware". Surprisingly with this board, anything higher than 50 Mhz did not work either.

Now I have moved to Nios-II eval board, Stratix-II edition. I think Altera no longer supports it, but I could find the source files for AN440 targeting this board. Managed to generate SOF and ELF. Download worked as well. Ethernet link gets established and it is 100 mbps, full-duplex. No errors. However, Nios-II console displays IP addr of 0.0.0.0 during startup.

TCP/IP configuration is correct on the host system and I am using crossover cable for point-to-point ethernet. DHCP is off in BSP.

Any idea why would the IP addr get stuck at 0.0.0.0?

thanks,

-swguy

Altera_Forum · ‎06-07-2012

BTW, I am using MoreThanIP Marvell 10/100/1000 Ethernet daughter card on PROTO2 expansion connector of the board.

-swguy

Altera_Forum · ‎06-07-2012

A Cyclone IV is definitely able to make a Nios II system with Ethernet run at 100MHz. Did you specify correct timing requirements for Timequest? You can usr the timing advisor to check your project for configuration changes to optimize timing.

I also noticed that adding an Avalon MM Pipeline before the Nios CPU JTAG debug interface helps timing.

As for the 0.0.0.0 IP address, if you disabled the DHCP then you need to specify a static IP address. It is done in one of the source files, in a function called something like get_my_ip_address(). I don't remember the exact name, I haven't used the Interniche stack in a while.

Altera_Forum · ‎06-07-2012

A simple custom instruction (add the two 16bit halves of 'a' onto 'b') will significantly improve the checksum time (because the nios has no carry flag so can't do 32bit 'add with carry usually used for sw checksum).

You probably also want one for byteswap (16bit and 32bit).

Altera_Forum · ‎06-07-2012

I have provided static IP addr in one of the .h files. Will also try hardcoding in the get_ip_addr() call.

I am using the ref design "Nios-II Ethernet Acceleration" on the Nios-II eval, Stratix-II edition.

The ref. document for the Marvell 10/100/1000 PHY daughter card (santa cruz connector) says there is some layout issue, which causes the PHY and the DDR clks tied together. PHY clk is derived from SOPC PLL @ 125 MHz and the document says DDR controller must be disabled if you want to run the Ethernet due to this clk issue. It did not make sense to me as the ref design uses the DDR, and the pre-canned binaries seem to work.

Has anyone dealt with this issue?

Also trying to use the C2H for checksum. C2H documentation is old and 10.1 sp2 EDS does not have C2H - must use legacy EDS I think. Would C2H for checksum be better choice than custom instruction?

thanks,

-swguy

Altera_Forum · ‎06-12-2012

Daixiwen,

Fixed the 0.0.0.0 IP issue. It required -DTSE_MY_SYSTEM and a .c file defining it. Also got the round-trip to be under a millisecond by performing few optimization like fast packet mem, deeper cache and /f core with faster MHz.

The link is auto-negotiating @ 100 mbps. Any idea why it would not go to gigabit? Both sides are capable of doing gigabit and proper cable is used.

thanks,

-swguy

Altera_Forum · ‎06-12-2012

I don't see any reason why the PHY chip wouldn't negotiate a gigabit speed if the other side advertises it. The autonegotiation is independent from the PHY-TSE interface so it should go to gigabit whatever happens on the FPGA side, except if the PHY MDIO registers controlling the gigabit function are modified by the driver. AFAIK the Altera driver doesn't modify those registers.

Did you try to connect the board to something else than your PC with the embedded OS? Just to see if it isn't an incompatibility between this PC and the Marvel PHY?

The only other case where I saw a PHY negotiating a lower speed than what it was supposed to was when it was receiving a clock that was slightly out of spec, but I don't think it's the case here.

Altera_Forum · ‎06-12-2012

I tried the board with Ref design with a PC and it does establish gigabit link. The ref design has Altera's GUI for board test interface.

The ref design was modified for various optimization and Nios code was changed for UDP server. May be I missed something in the BSP or the TSE driver needs some tweaking?

-swguy

Altera_Forum · ‎06-13-2012

When you say "PC" is it a different PC than the one with embedded OS that you tried with your system? If yes can you try this other PC with your design? That will tell if the incompatibility is in your design or something on the PC with the embedded OS.

If you still see a difference between the two FPGA designs on the same PC, then look for differences between them, especially on the clock and reset signals to the PHY chip. The autonegotiation phase is done automatically by the PHY chip and shouldn't depend on much more from the TSE (except if you are disabling gigabit on purpose from the MDIO registers, but it shouldn't be the case if you are using an unmodified Altera TSE driver).

Altera_Forum · ‎06-13-2012

Yes, the difference is between two FPGA designs and same PC running the RTOS (same binary on PC). So the PHY/board is capable of gigabit link, but somehow can not establish with my design.

Good points for clk and reset signals on PHY - will look into it.

I am not manipulating MDIO at all, but that raises a good point: Can I deliberately tweak MDIO to go for gigabit, just in case? Is there a HAL API for this, if we know TSE's parameters from system.h?

-swguy

Altera_Forum · ‎06-14-2012

It's been a while since I experimented with the Marvel PHY. I remember it can be rather picky on it's reset signal. It needs to be low for at least 10 clock cycles (as seen on its XTAL1 pin).

You can manipulate MDIO through the TSE registers. The best way to do it is to first let the driver initialize itself, and check in the Nios terminal that the PHY was properly detected. Then you know the MDIO interface is working and the correct PHY address has been set in the TSE.

Then you can use those two macros from <triple_speed_ethernet_regs.h> to read/write the MDIO registers:

IORD_ALTERA_TSEMAC_MDIO(base, mdio, reg_num)
IOWR_ALTERA_TSEMAC_MDIO(base, mdio, reg_num, data)

The altera driver is using the MDIO1 bank in the TSE so you need to set the mdio parameter to 1. base is the TSE base address, as defined in <system.h>.

The biggest problem with the Marvel PHY is that the datasheet isn't publicly available, and the MDIO registers for Gigabit control aren't standardized. You can nevertheless have a look at MDIO register 09, which is the gigabit control register. If bit 8 is 1 then the PHY will advertize 1000BASE-T half duplex, and if bit 9 is 1 then the PHY will advertize 1000BASE-T full duplex. The MDIO register 0A is the gigabit status register. Those bits can be of interest to you:[list][*]15: 1=MASTER/SLAVE configuration fault detected[*]14:1=local PHY is master, 0= local PHY is slave[*]13:1=Local receiver OK[*]12:1=Remote receiver OK[*]11: 1= link partner is capable of 1000BASE-T full duplex[*]10: 1= link partner is capable of 1000BASE-T half duplex[*]7-0: Idle error count since last read (no roll-over, stops at 0xff)[/list]

Altera_Forum · ‎06-14-2012

thanks Daixiwen!