Altera TSE driver and example program for lwIP (1.3.2) - Page 3

Altera_Forum · ‎06-21-2010

After many many requests and complaints about lack of support and/or documentation for support of lwIP for the Altera TSE, I have developed a drop-in TSE driver and example program and made this available to the NIOS II community. This was done for NIOS II 8.1 SP0.01. I don't expect difficulty with version 9.x.

This is for the latest version of lwIP (the latest is as of this post) for a minimal program and HTTP server based on the http server in the lwIP contrib folder. The lwIP TSE driver uses the altera_avalon_tse driver and SGDMA as-is. There is a complete (as in 41-step) set of instructions on creating the project and example program. More information and the link to the driver is available here:

http://lwip.wikia.com/wiki/available_device_drivers#lwip_1.3.2

Please direct any questions, changes for NIOS II 9.1, or comments to this thread.

12-16-2010 update: This example works with NIOS Version 10.0 with some tweaks to the procedure to create the project. Also, a lwIP 1.4 release candidate has been out for a while and it drops into this example (in place of 1.3) without changes.

Bill

Altera_Forum · ‎09-14-2010

Rapa,

Can you ping the board you're running lwIP on?

Bill

Altera_Forum · ‎09-14-2010

No,I cannot

Altera_Forum · ‎09-14-2010

Run the Simple Socket Server example setting a MAC address as it shows you to do and try to ping the board.

Altera_Forum · ‎09-16-2010

Rapa!

You don't provide enough information to let someone help you. Please clarify:

1) Is your PHY detected correctly? Grab debug output from tse driver (example shown in BillA's original post) and post it here.

2) Have you tried at 100M? 10M?

3) Do you see any packets from your board in the network (tcpdump is handy)? Any activity on TXD lines?

4) Are you receiving?

In any case, this is not lwIP specific problem, as Biil said. Both lwIP example and Simple Socket Server should run as is if hardware set up correctly. Probably something wrong with SOPC or top level FPGA design (for example I spent a lot of time trying to run my board at 1000M until I got 2 lines for sdc file from bertronicom).

I would suggest you to use Simple Socket Server template as a standard reference, and after you make it running (ping, DHCP and telnet server) switch to lwIP for better performance and smaller footprint.

Igor

Altera_Forum · ‎09-16-2010

Many thanks for your comments, Igor!!!

Altera_Forum · ‎09-20-2010

Ims. Thank you for your reply.

Here is answers:

1.PHY detected correctly.Exactly as shown in BillA's example.

2.At this stage I am always working at 100Mb. Board is connected to switch 10/100 Mb.

3.I do not see no packets from the board in the network(using TCP/IP sniffer). No activities on TxD line. But there is something strange. When i am trying to connect with PC to board(on sniffer i see it as a broadcast), i receive the packet as broadcast and decode it as broadcast and i send the answer to broadcast, but this answer does not go to network. It happens at ARP level.

4.Yes.I am receiving. I see it by debug and also I see it by Rx LED.

Altera_Forum · ‎09-21-2010

You Could check your timming assignments for the Tx lines

Altera_Forum · ‎09-21-2010

I don't get you.What do you mean?

Altera_Forum · ‎09-21-2010

signal of data and the clock signal must be come at deterministic time, if your data are coming out of time, your data could be lost. You can set your timing assignments in Quartus.

Altera_Forum · ‎09-22-2010

I have downloaded the lastest version of lwIP v1.4 rc1. The compilation was OK with no modifications from the original version

I am sending raw images to the host by UDP packets.

Quartus II v8.0 sp1

Cyclone III 3c40

Small MAC 100mbps

maximum throughput achieved when sending raw images (using -O3 optimizations) is 32Mbps

I haven't seen speed increase with the version v1.4rc1

Bill: How can i assign these functions to run in RAM?

Igor: Where can I find LWIP_INLINE_IP_CHKSUM ?

Best regards and many thanks for your great help!!!

Altera_Forum · ‎09-22-2010

Alberto!

1) LWIP_INLINE_IP_CHKSUM is defined by default in lwIP/src/core/ipv4/ip.c. line 64. I guess the correct way to turn it on is# define LWIP_INLINE_IP_CHKSUM 1 in "lwipopts.h".

2) The patches I had to make to compile original lwIP_Nios_II_Example with lwIP_1.4.0rc1 are the following:

lwipopts.h: +# define NO_SYS_NO_TIMERS 1

main.c: +# include "lwip/tcp_impl.h"

3) To switch off UDP checksum:

lwipopts.h: # define CHECKSUM_GEN_UDP 0

4) Please note also that the benchmarks I had posted were related to the benchmarking program I had attached. It is intentionally very simplistic and just sends continuously the same preallocated pbuf without any processing to give upper bound estimation for stack itself.

5) Verify that bsp is compiled with -O3 also.

Best regards,

Igor

Altera_Forum · ‎09-22-2010

Alberto!

I have just ran the benchmark from my post #35 (http://alteraforum.org/forum/showpost.php?p=100185&postcount=35) with trivial simulation of "data processing". The results are following (100M network, DBM3C40 board, Nios @100MHz, worst case memory layout, full-mac, Quatus 10.0sp1):

Original code

Benchmark: 95.4 Mib/sec

Original + memset(_payload, ++_cnt, PAYLOAD_SIZE) just before udp_send()

Benchmark: 83.2 Mib/sec

Original + memmove(((char *)_payload)+1, _payload, PAYLOAD_SIZE-1) just before udp_send()

Benchmark: 45.4 Mib/sec

It gets closer to what you have, isn't it? So, it looks like you should optimize complete program, not just lwIP code. This may be a hard work, but Bill gave a good roadmap. As for me I'm not working currently on lwIP part of my project because still playing with UDPOFFLOAD. By the way, it streams very nice @ 100M network - about 95Mib/sec with zero load on Nios. @1000M I observe about 750Mib/sec traffic, but about 20% of packets are lost. Now trying to slow down their generator to see it in more realistic conditions...

Best regards,

Igor

Altera_Forum · ‎09-22-2010

--- Quote Start ---

It gets closer to what you have, isn't it? So, it looks like you should optimize complete program, not just lwIP code. This may be a hard work, but Bill gave a good roadmap. As for me I'm not working currently on lwIP part of my project because still playing with UDPOFFLOAD. By the way, it streams very nice @ 100M network - about 95Mib/sec with zero load on Nios. @1000M I observe about 750Mib/sec traffic, but about 20% of packets are lost. Now trying to slow down their generator to see it in more realistic conditions...

--- Quote End ---

Igor, this 95Mb and 750Mb are with your offloading? I was in this ballpark with UDP_CHECKSUM off with software only. Of course the Cyclone III was doing nothing else.

Packet drop is caused by Windows. I had over 900Mb (virtually the wire speed) from tests with timing done on the Cyclone end. Windows can't keep up without special drivers but if you keep your bursts small - you get this if your UDP protocol has any sort of ACK/NAK checking - Windows can keep up. We transfer 250Mb+ on Windows without a special driver. Those really needing Windows speed use "Packet Filter Drivers" for Windows (Google it). This prevents drops (although I don't know the practical speed limit) and they significantly drop CPU utilization. My understanding is it removes one data copy since your Packet Filter Driver gets the packet from the NIC driver and before Windows TCP/IP stack does. I believe WireShark (WinPCap) is a Packet Filter Driver.

By the way, the example I posted was done with as few changes to Altera LIBs (TSE/SGDMA/PHY) as possible and with no speed improvements. The product we use the Cyclone III in has custom version of Altera's TSE, SGDMA, and PHY drivers and lwIP is further optimized over what is here and what I recommended. Even so I believe the example posted outperforms Interniche and is smaller and this really was my goal in doing this. Besides the fact that Altera made the decision to drop support for lwIP a couple of years ago and I think it's a viable choice for TCP/IP.

Note that lwIP 1.4 has an improved HTTP server with SSI/CGI support. I hope to upgrade this example to use that when 1.4's release is official.

Keep us posted - this has been very interesting.

Bill A

Altera_Forum · ‎09-22-2010

Hi BillA -

You mentioned that you were able to source packets at higher throughput with these modifications - has anyone benchmarked the improvement in sinking raw data into dram buffers using these mods?

I have an asymmetric application that requires high volume of input data with very small control frames sent out...

crayner

Altera_Forum · ‎09-22-2010

Hi Crayner,

Sorry, my application is high data throughput outbound only. I do use lwIP on another platform with high data throughout inbound and it does really well there. In fact we can use TCP (not UDP) and get all the bandwidth we require (500Mb+). Now, this is on a PowerPC so it's not your average embedded processor. :)

Bill A

Altera_Forum · ‎09-23-2010

--- Quote Start ---

Igor, this 95Mb and 750Mb are with your offloading? I was in this ballpark with UDP_CHECKSUM off with software only. Of course the Cyclone III was doing nothing else.

--- Quote End ---

Bill, yes, that was with offloading example, which Daixiwen advertised earlier in this thread. I like the concept to keep away from Nios core high-speed traffic and to process throw lwIP only control communications and slow data streams e.g. temperatures. Comparing to purely software solution this should simplify Nios programming, but requires more advanced HDL coding. I haven't chosen the way yet...

Thank you for PC-side guidelines also. As for the lwIP, I think you are doing great job keeping lwIP port and example project for Nios up-to-date. INiech + uC/OS with its preemptive tasking is too cumbersome combination for embedded applications and I'm sure many guys from Nios community are greatly appreciate your efforts.

Igor

Altera_Forum · ‎09-24-2010

Igor -

I spoke to RF at Altera on Weds AM - he not only wrote SSS but also the offload example. We touched on several things, on this subject I can share the following.

RF's take on this was that both of the examples could/should be used in tandem by developing a protocol in which data bursts passing through the bypass would be preceded by a tcp/ip control packet specifying data type/destination queue, etc. allowing the nios to set up a transfer. A modification to the packet checker would be made to interface it with a fifo and back it with a stream-in/memory mapped output SGDMA engine that would hammer the memory into dram - the fifo acting like an elastic store to ensure refresh cycles, etc wouldn't drop data.

My take on the subject is slightly more cynical; Interniche has a product; by allowing people in the Altera community to use a performance-castrated version of it, they get small companies to design them into their embedded solution. When they find that the performance is limited, they buy the subscription version that includes performance enhancements like BillA made to the LwIP stack. Since the performance metrics are published for all to see, people know what they are getting into before the design in. If you don't find the metrics in the 40,000 pages of documentation Altera published, it is not their fault.

I had him on the phone for an hour - I can tell you this: At the end of the conversation I had decided that it would be far quicker and easier to pitch the Interniche stack and implement LWIP in my design than to implement the HW speed up advocated by RF - if I find that in my embedded system that the performance isn't there then I will take the HW route.

my best to all

-crayner

Altera_Forum · ‎09-25-2010

Hi guys, I've defitivatelly achieved 58Mbps sending images by UDP, it suposes for me a data transfer of 41fps.

lwip 1.4

Optimization -O3

DMA isntead of memcopy's

TSE-SGMA

Small Mac 10/100Mb

Nios /f @ 100Mhz

Cyclone III 3c40

Quartus II v8.0sp1

Altera_Forum · ‎10-04-2010

--- Quote Start ---

Hi guys, I've defitivatelly achieved 58Mbps sending images by UDP, it suposes for me a data transfer of 41fps.

lwip 1.4

Optimization -O3

DMA isntead of memcopy's

TSE-SGMA

Small Mac 10/100Mb

Nios /f @ 100Mhz

Cyclone III 3c40

Quartus II v8.0sp1

--- Quote End ---

Alberto!

Probably you can achieve better speed if you avoid memory copy completely. I found out that ROM-type pbufs are very handy for this. If you preallocate single ROM-type pbuf, than you only need to set its payload pointer, len and tot_len fields for each packet transfer. On my board this approach gives 210M (decimal M) throughput for transfers from external SDRAM.

Further, to avoid memory allocations inside lwIP, you can preallocate RAM-type pbuf for headers, ROM-type pbuf for payload and chain them together. This allows 250M throughput.

Finally, you can set-up the DMA, which delivers your data into memory, to leave space for a pbuf structure and headers in front of the payload and use this space for single RAM-type pbuf. This allows 280M on my board.

You may see attached file for example. The benchmark, as previously, runs with original Bill's example upgraded to lwIP1.4.0_rc1. UDP payload size is 1472 bytes, no checksum.

In my project I use hardware UDP offloading for main data stream. (In fact, thanks to freely available IP cores from offloading example (#36 (http://alteraforum.org/forum/showpost.php?p=100189&postcount=36)), hardware solution is even easier to implement). All outgoing packets are copied by dedicated SGDMA channel into large circular buffer in external SDRAM for retransmission upon receiver's request. I haven't implemented top-level protocol yet, and currently just retransmit all the packets from the buffer through lwIP to another UDP port for benchmarking.

Best regards,

Igor

Altera_Forum · ‎10-04-2010

Igor, are you also using udp_sendto_if? If not, try it - you'll find it makes things faster still.

Bill

Altera_Forum · ‎11-07-2010

Hi guys. Finally i have found what was a problem.Why i could not to run not "simple_server..." and not a BillA'a example. They,both, has been configured to work with GMII/MII interfaces and my evaluation board has been configured and connected to work with RGMII interface.

So if anyone meets the same mistake,now you know the reason and solution is here:

http://www.altera.com/support/kdb/solutions/rd11122009_293.html

P.S. From my side the problem is solved.

Thank you,all.

Slava