Nios® V/II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® V/II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
12745 토론

Altera TSE driver and example program for lwIP (1.3.2)

Altera_Forum
명예로운 기여자 II
39,670 조회수

After many many requests and complaints about lack of support and/or documentation for support of lwIP for the Altera TSE, I have developed a drop-in TSE driver and example program and made this available to the NIOS II community. This was done for NIOS II 8.1 SP0.01. I don't expect difficulty with version 9.x. 

 

This is for the latest version of lwIP (the latest is as of this post) for a minimal program and HTTP server based on the http server in the lwIP contrib folder. The lwIP TSE driver uses the altera_avalon_tse driver and SGDMA as-is. There is a complete (as in 41-step) set of instructions on creating the project and example program. More information and the link to the driver is available here: 

 

http://lwip.wikia.com/wiki/available_device_drivers#lwip_1.3.2 

 

Please direct any questions, changes for NIOS II 9.1, or comments to this thread. 

 

12-16-2010 update: This example works with NIOS Version 10.0 with some tweaks to the procedure to create the project. Also, a lwIP 1.4 release candidate has been out for a while and it drops into this example (in place of 1.3) without changes. 

 

Bill
0 포인트
257 응답
Msg06484
초급자
1,943 조회수

Hi, I downloaded the latest LWIP.  It somewhat works in that I have to start two cmd shells and send ping requests from both so that the second will "unbuffer" the first.  I tried replacing the err_t tse_mac_raw_send(struct netif *netif, struct pbuf *pkt)   with yours above, but the compiler complains that 

ALT_LINK_ERROR("alt_remap_uncached() is not available because Nios II Gen2 cores with data caches don't support mixing cacheable and uncacheable data on the same line.");

From alt_remap_uncached.c

I tried putting it back to use

ActualData = (void*)(((alt_u32)data)); which does compile but does not actually reply to the ping.

below is the tse_mac_raw_send()  found in the latest lwip lwip_tse_mac.c

I am not sure if this is causing the issue, it seems like the packet is coming in and getting buffered until the next packet arrives that is why having 2 ping running makes everything appear to work.

Does this make sense?  I may be completely wrong, any help is appreciated.

err_t tse_mac_raw_send_orig (struct netif *netif, struct pbuf *pkt)
{
int                tx_length;
unsigned           len;
struct pbuf        *p;
alt_u32            *data;
tse_mac_trans_info *mi;
lwip_tse_info      *tse_ptr;
struct ethernetif  *ethernetif;
alt_u32    *ActualData;
 
/* Intermediate buffers used for temporary copy of frames that cannot be directrly DMA'ed*/
char buf2[1560];
 
ethernetif = netif->state;
tse_ptr = ethernetif->tse_info;
mi = &tse_ptr->mi;
 
for(p = pkt; p != NULL; p = p->next)
{
data = p->payload;
len = p->len;
 
// just in case we have an unaligned buffer, this should never occur
if(((unsigned long)data & 0x03) != 0)
{
/*
* Copy data to temporary buffer <buf2>. This is done because of alignment
* issues. The SGDMA cannot copy the data directly from (data + ETH_PAD_SIZE)
* because it needs a 32-bit aligned address space.
*/
memcpy(buf2,data,len);
data = (alt_u32 *)buf2;
}
 
// uncache the ethernet frame
ActualData = (void*)(((alt_u32)data));
 
/* Write data to Tx FIFO using the DMA */
alt_avalon_sgdma_construct_mem_to_stream_desc(
(alt_sgdma_descriptor *) &tse_ptr->desc[ALTERA_TSE_FIRST_TX_SGDMA_DESC_OFST], // descriptor I want to work with
(alt_sgdma_descriptor *) &tse_ptr->desc[ALTERA_TSE_SECOND_TX_SGDMA_DESC_OFST],// pointer to "next"
(alt_u32*)ActualData,                    // starting read address
(len),                                   // # bytes
0,                                       // don't read from constant address
p == pkt,                                // generate sop
p->next == NULL,                         // generate endofpacket signal
0);                                      // atlantic channel (don't know/don't care: set to 0)
 
tx_length = tse_mac_sTxWrite(mi,&tse_ptr->desc[ALTERA_TSE_FIRST_TX_SGDMA_DESC_OFST]);
 
if (tx_length != p->len)
dprintf(("failed to send all bytes, send %d out of %d\r\n", tx_length, p->len));
 
ethernetif->bytes_sent += tx_length;
}
 
LINK_STATS_INC(link.xmit);
 
return ERR_OK;
}
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Bill, 

 

--- Quote Start ---  

 

First, pbuf payload is guaranteed aligned. I use custom version of this driver (more optimized and improves on SGDMA and PHY handling plus some bug fixes in the Altera code) and took out the alignment check and added an assertion and it never asserts. If user code sets payload (a bad practice) than it's not guaranteed, but pbuf_alloc aligns payload. 

 

--- Quote End ---  

 

As we discovered all together under some circumstances innocent tcp_write(pcb, long_help_message, len, NOCOPY) can have an effect of “user code setting payload”. How could I expect this in advance? 

 

 

--- Quote Start ---  

An application that uses UDP for high data rates is probably using a custom client in which case the payload size could be enforced. I use a zero-copy UDP high speed reliable protocol (the hardware writes in the UDP payload and checksum) and in this case the payload is large and always aligned. 

 

--- Quote End ---  

Yes, for UDP alignment is under developer’s control. I also had to reinvent some kind of “zero-copy reliable UDP”. I use LWIP for retransmitting only – this makes life easier. Actually, if you cook UDP packets in hardware, you can retransmit a packet without LWIP intervention at all – just instruct one SGDMA to ship out exactly what had been delivered into memory by another SGDMA. 

 

--- Quote Start ---  

You only need to unwind if pkt->next isn't NULL. An unchained pbuf can never have a payload size less then 4 because of the IP header. I'd put back the pkt->next != NULL test before the for loop. 

 

--- Quote End ---  

Agree. 

 

 

--- Quote Start ---  

Minor: // Unwind pbuf chains if (pkt->tot_len > sizeof(buf2)) { // no space for unwinding; drop the packet return ERR_OK; } Can't occur. lwIP will never chain more than the MTU. It can't because 802.11 cannot support it. 

 

--- Quote End ---  

Would be better to turn into ASSERT. (For those who like building pbufs by hand, like me :) ). 

 

 

--- Quote Start ---  

This bug in the SGDMA (it's a bug if you can't sent 1-x bytes) maybe why InterNiche is so slow and full of copies. It's too bad it cripples lwIP which otherwise is much more efficient. 

 

I wonder if we can write the small payload bytes right to the MAC. I.e. don't use SGDMA but use a for loop to write 1 to 3 bytes to the MAC buffer (same place the SGDMA writes to)? Does either of you or anyone know if this could be done? I think I'll put in a service request for this as a SGDMA bug. 

 

--- Quote End ---  

If you only need to insert 1-3 bytes at the _end_ of packet, you may try standard “On-Chip FIFO Memory Core” (or very simple custom component) and mux its output with the SGDMA (though I never implemented this in hardware). I don’t know readily available solution for general case (when incomplete words are inside the packet). Should not be very difficult to code. The main challenge seems to be repacking of the stream to become Avalon-ST complaint: e.g. <sop4><4><1><4><2eop> into <sop4><4><4><3eop>. 

 

 

--- Quote Start ---  

Great discussion - thanks! 

 

--- Quote End ---  

Thank you also!
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Igor, 

 

 

--- Quote Start ---  

How could I expect this in advance? 

--- Quote End ---  

 

 

I didn't mean to be condescending. I guess we have to remember that zero-copy means "use the given pointer to the payload". Even so, the unaligned pointer was not your issue here. That was handled by the driver (and now I see why it's good to not take it out even with there being a copy). What got you is the "me called" SGDMA bug. Aligned or not the 1-byte transfer was not going to work. 

 

 

--- Quote Start ---  

Yes, for UDP alignment is under developer’s control. I also had to reinvent some kind of “zero-copy reliable UDP”. I use LWIP for retransmitting only – this makes life easier. Actually, if you cook UDP packets in hardware, you can retransmit a packet without LWIP intervention at all – just instruct one SGDMA to ship out exactly what had been delivered into memory by another SGDMA. 

--- Quote End ---  

 

 

If you NAK each packet, yes. And sending is faster. I have a multi-nak protocol which might have to go back in the buffer and resend one or more packets. I actually build and checksum the IP and UDP headers in a static position and only patch the checksum before sending the packet header and payload. This is because most of the header doesn't change. lwIP could incorporate this too because once connected to a PCB, much of the header and that checksum part is static. 

 

 

--- Quote Start ---  

Would be better to turn into ASSERT. (For those who like building pbufs by hand, like me :) ). 

--- Quote End ---  

 

 

Sure, good idea. 

 

 

--- Quote Start ---  

If you only need to insert 1-3 bytes at the _end_ of packet, you may try standard “On-Chip FIFO Memory Core” (or very simple custom component) and mux its output with the SGDMA (though I never implemented this in hardware). I don’t know readily available solution for general case (when incomplete words are inside the packet). Should not be very difficult to code. The main challenge seems to be repacking of the stream to become Avalon-ST complaint: e.g. <sop4><4><1><4><2eop> into <sop4><4><4><3eop>. 

--- Quote End ---  

 

 

Too bad there's no MAC register to "queue a byte into the frame buffer". In reality if it were designed well you should be able to use the MAC in either SGDMA mode or direct CPU write mode. It's just receiving data. You would need START, DATA, and STOP registers to start, send data, and terminate a packet. 

 

Bill
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

There may be a much faster way to handle these 1-3 byte SGDMA transfers. 

 

I believe it to be true that there would only be one 1-byte pbuf in a chain, is this true? If it *is* true, you could ensure that all pbufs are allocated with an extra 4 bytes of payload. This would be done for PBUF_POOL (not sure about PBUF_RAM). Take the 1 to 3 bytes of the next pbuf and add it to the end of the payload of the preceding one. Increase that len by 1 to 3 and set the second one to 0. Leave the chain in place but skip pbufs with len==0. 

 

This would be very efficient for the common case of a trailing short pbuf chain. 

 

For PBUF_REF you may have to unwind to a new pbuf. Or ensure 0-copy doesn't use small payloads (probably not hard to manage). 

 

Bill
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

I'm new to this tse code and example program but it seems that in the implementation of timers when the variable lwip250mStimer reaches the limit of alt_u32 timers will not work anymore. Am I right ? according to this such program will work for approximately 50days. 

 

thanks
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Good point! Yes, true, about 50 days. The point was to provide a TSE lwIP driver and the application is far from production ready (or production usable in a real application). However, this example isn't correct and it should be. Originally I was going to post only the TSE driver. I thought it would be nice to see it work so threw in main.c - hastily as you pointed out: 

 

The: 

lwip250mStimer += 250;Should be: 

if( lwip250mStimer >= 0xFFFFFFFF - 249 ) lwip250mStimer = 0; else lwip250mStimer += 250; Thanks - good find. There is probably a better and/or more efficient way to call lwIP timers. lwIP also now includes timer handling internally so if used would avoid this bug as well. See timers.c/h in lwIP and LWIP_TIMERS in lwipopts.h. 

 

Bill
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Thanks for your reply, 

 

I have another problem :( 

When i use a tcp_write() with data which are 535 bytes long i can get to 10Mbit speed on 100Mbit link. 

But when the length of the data is more than 535 bytes performance drops to the speed around 120kbit :(  

 

Also 10Mbit speed isn't very much but i assumed that with bigger data chunks the speed will be increased but it isn't true :( 

 

Anyway with UDP i can get to 82Mbit on 100Mbit link.... 

 

Any ideas? 

 

thanks
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

lwipopts.h is not optimized and close to the shipped opts.h. One set of settings for one person's idea of fast might not be for another's. Or might take up more resources than is available. There are many posts on the lwip forum (at savannah.org) about "optimum" options. On a NIOS II system with gobs of RAM, I hike TCP_WND, and allow 1000's of packet buffers. Also allow more pending packets (MEMP_NUM_TCP_SEG) and use a full MSS. Unfortunately one of the limits in speed is the Altera driver which is partly why Interniche is dead slow (and dies with any kind of packet flood). 

 

TCP writing is faster than reading. You should get close to 100Mpbs in both directions with "good" lwip options. I use Gig and it's OK on sending but has trouble with high speed receiving - I'm still trying to get a hardware update to get flow control working. My high bandwidth stuff is using UDP and 500Mbps outbound is easily possible. My TCP app with large amounts of inbound TCP isn't so critical but needs to be 40Mbps or so which was not a problem. 

 

I use RAW API in lwIP and that's a huge speed advantage over netconn or worse, sockets. It all depends on what you need for the API. 

 

Somewhere I posted the 10 or 12 best things you can do to improve performance. Note that in lwIP 1.4.x a couple of the speed improvements came from patches submitted by me (notable inline IP checksumming). 1.4.x is definitely better than 1.3.x which this example is based on. I recommend using the latest lwIP release. 

 

Hope this helps, 

Bill
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Timers should normally use 'modulo' arithmetic - where you always subtract the two values that contain 'times' and then look at the value of the difference. 

So if the current time is 'ms_ticks' use: 

if ((int)(ms_ticks - lwip250mStimer) >= 0) { /* timer has expired */ lwip250mStimer += 250; }Rather than checking for ms_ticks >= lwip250mStimer.
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Bill, 

 

 

--- Quote Start ---  

 

If you NAK each packet, yes. And sending is faster. I have a multi-nak protocol which might have to go back in the buffer and resend one or more packets. I actually build and checksum the IP and UDP headers in a static position and only patch the checksum before sending the packet header and payload. This is because most of the header doesn't change. lwIP could incorporate this too because once connected to a PCB, much of the header and that checksum part is static. 

 

--- Quote End ---  

 

When you receive multi-nak response you may simply build a list of descriptors, pointing to appropriate stored packets, and feed it to SGDMA to retransmit complete batch asynchronously. The overhead is large number of SGDMA descriptors (I used one for every packet in the retransmit buffer) + space for headers + need for dedicated SGDMA hardware. Also I had to design my protocol such that retransmitted packets are exact copies of the originals. I tested this for up to 100% retransmit rate @ 300MBit/sec. Frankly speaking now I switched back to simple udp_send_to_if() because retransmit traffic is miserable on real LAN when appropriate software at host side is used. 

 

Igor
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

Bill, 

 

 

--- Quote Start ---  

There may be a much faster way to handle these 1-3 byte SGDMA transfers. 

 

I believe it to be true that there would only be one 1-byte pbuf in a chain, is this true? If it *is* true, you could ensure that all pbufs are allocated with an extra 4 bytes of payload. This would be done for PBUF_POOL (not sure about PBUF_RAM). Take the 1 to 3 bytes of the next pbuf and add it to the end of the payload of the preceding one. Increase that len by 1 to 3 and set the second one to 0. Leave the chain in place but skip pbufs with len==0. 

This would be very efficient for the common case of a trailing short pbuf chain. 

 

--- Quote End ---  

 

May be, but not sure. I didn’t analyze TCP code in depth, but it looks like LWIP 1.4.0 tends to combine all writes, even NOCOPY ones, into single oversized buffer. That is why the driver usually sees a single “long” pbuf for single TCP packet. The notable exception I was “lucky” to trap into is a sequence of NOCOPY writes, issued in the moment when internal buffer is empty. In this state TCP allocates separate pbuf for the headers and chain it with the PBUF_REFs, pointing to user supplied data. 

 

--- Quote Start ---  

 

For PBUF_REF you may have to unwind to a new pbuf. Or ensure 0-copy doesn't use small payloads (probably not hard to manage). 

 

--- Quote End ---  

 

Surely this could be easily managed when one builds application from ground. But could be rather confusing when such things happen deep inside third-party code e.g. http server. 

 

Igor
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

PETRAK, 

 

--- Quote Start ---  

I'm new to this tse code and example program but it seems that in the implementation of timers when the variable lwip250mStimer reaches the limit of alt_u32 timers will not work anymore. Am I right ? according to this such program will work for approximately 50days. 

 

--- Quote End ---  

 

I would add to Bill’s comment that the timer code in the example is error prone. If any of XXX_TMR_INTERVAL constants happens to be not divisible by 250, you will get incorrect period LCM(XXX_TMR_INTERVAL, 250) for that timer. For example, if you simply enable AUTO_IP in your lwipopts.h, you will have 500ms period for autoip_tmr() instead of intended 100ms. 

 

Dsl, The problem is not the wrapping itself, but that 2^32 is not divisible by 250. 

 

Igor
0 포인트
Altera_Forum
명예로운 기여자 II
2,155 조회수

If you use modulo arithmetic, it doesn't matter whether the timeout divides into 2^32. 

If you try not to use modulo arithmetic you have real trouble expiring certain timers - eg 257 ticks.
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

 

--- Quote Start ---  

Timers should normally use 'modulo' arithmetic - where you always subtract the two values that contain 'times' and then look at the value of the difference. 

So if the current time is 'ms_ticks' use: 

if ((int)(ms_ticks - lwip250mStimer) >= 0) { /* timer has expired */ lwip250mStimer += 250; }Rather than checking for ms_ticks >= lwip250mStimer. 

--- Quote End ---  

 

 

No, that doesn't work. It doesn't keep lwip250mStimer a multiple of 250 which the tests in the timer code require. 

 

Bill
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

 

--- Quote Start ---  

lwipopts.h is not optimized and close to the shipped opts.h. One set of settings for one person's idea of fast might not be for another's. Or might take up more resources than is available. There are many posts on the lwip forum (at savannah.org) about "optimum" options. On a NIOS II system with gobs of RAM, I hike TCP_WND, and allow 1000's of packet buffers. Also allow more pending packets (MEMP_NUM_TCP_SEG) and use a full MSS. Unfortunately one of the limits in speed is the Altera driver which is partly why Interniche is dead slow (and dies with any kind of packet flood).  

--- Quote End ---  

 

Is Interniche so bad idea ? My idea was to buy it when the lwIP would not be an option (due to performance). 

 

 

--- Quote Start ---  

 

TCP writing is faster than reading. You should get close to 100Mpbs in both directions with "good" lwip options. I use Gig and it's OK on sending but has trouble with high speed receiving - I'm still trying to get a hardware update to get flow control working. My high bandwidth stuff is using UDP and 500Mbps outbound is easily possible. My TCP app with large amounts of inbound TCP isn't so critical but needs to be 40Mbps or so which was not a problem. 

 

--- Quote End ---  

I wonder how you can achieve this :) maybe i am doing something wrong, probably it is the timers problem (i'm using the implementation from your example for lwIP 1.3) , i should have a look at the timers implementation introduced in lwIP 1.4 

 

 

--- Quote Start ---  

 

I use RAW API in lwIP and that's a huge speed advantage over netconn or worse, sockets. It all depends on what you need for the API. 

 

--- Quote End ---  

Me too, no OS so RAW API used 

 

 

--- Quote Start ---  

 

Somewhere I posted the 10 or 12 best things you can do to improve performance. Note that in lwIP 1.4.x a couple of the speed improvements came from patches submitted by me (notable inline IP checksumming). 1.4.x is definitely better than 1.3.x which this example is based on. I recommend using the latest lwIP release. 

Hope this helps, 

Bill 

 

--- Quote End ---  

I'm already using lwIP 1.4 , could please point me to the differences in the example between lwIP versions ? 

 

Thank You
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

 

--- Quote Start ---  

Is Interniche so bad idea ? My idea was to buy it when the lwIP would not be an option (due to performance). 

--- Quote End ---  

 

 

The opposite is true - you will switch back to lwIP if you try InterNiche because it is slower. It is also less robust - I can easily lock it up where I can't lock up lwIP doing the same test. 

 

 

--- Quote Start ---  

 

I wonder how you can achieve this :) maybe i am doing something wrong, probably it is the timers problem 

--- Quote End ---  

I couldn't do this at first, although for my first year with lwIP it was on a 533MHz PowerPC embedded system and it could do the full link speed with TCP. I didn't hit performance problems until the NIOS II. I needed 240Mpbs with TCP and was unable to achieve that but with UDP I could do 800Mpbs. But along the way we added TCP/UDP checksumming in hardware. That would have helped TCP but I knew it wouldn't be enough still. I spent 2 months optimizing lwIP - the drivers (SGDMA, TSE and PHY are mostly rewritten and memcpy is my own in assembly language). Plus a lot of experimenting. That's where the aforementioned list came from. 

 

 

--- Quote Start ---  

(i'm using the implementation from your example for lwIP 1.3) , i should have a look at the timers implementation introduced in lwIP 1.4 

--- Quote End ---  

I don't use them - I use something similar to this example but used with an OS (but I use NO_SYS=1). 

 

 

--- Quote Start ---  

Me too, no OS so RAW API used 

--- Quote End ---  

I use a cooperative OS (written myself) - you almost have to use an OS to keep TCP/IP happy in the background while having your application run in the foreground. My system is event and ISR driven so TCP/IP gets all the CPU time outside of events and interrupts. 

 

 

--- Quote Start ---  

I'm already using lwIP 1.4 , could please point me to the differences in the example between lwIP versions ? 

--- Quote End ---  

That's good. You have to optimize until you see speeds you can live with. 

 

Bill
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

 

--- Quote Start ---  

Definitely see: http://www.alteraforum.com/forum/showpost.php?p=100193&postcount=37 

--- Quote End ---  

 

+1 These recommendations are VERY helpful.
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

Hey guys, 

 

 

i got a error using this example. 

 

i build my example and run it one hardware (iniche sss example runs well). I get following Output. 

 

Running... 

Waiting for link...OK 

Waiting for DHCP IP address...IP address: 192.168.221.130 

 

Why i dont get any informations about the PHY like the example shows? 

 

i use another phy chip then Marvel PHY 88E1119 i think marvell 88E1111. Something to do with that ? 

 

I can also ping my board now and can connect to the http server in the browser but im wondering why i cant get this informations. 

 

thanks for help. 

 

Rene
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

 

--- Quote Start ---  

Hey guys, 

Why i dont get any informations about the PHY like the example shows? 

Rene 

--- Quote End ---  

 

Rene, PHY is managed by altera_avalon_tse.c, not by lwip. To enable debug output compile your BSP with -dALT_DEBUG.
0 포인트
Altera_Forum
명예로운 기여자 II
2,160 조회수

hey ims, 

 

thx for this 

 

 

i got now an other prob i tried to make my own example with tcp and exclude the http files and make my own tcp files. I can connect and i can receive data over port 7 TCP but when i try to make pbuf_free() after the receive tcp_recevd i get the error 

 

Assertion "pbuf_free: p->ref > 0" failed 

 

In the web i read it has to do with multiple acces on pbuf the file lwip_tse_mac.c access to this to is there the problem and why this error doenst occure when i use the http example from here. 

 

thanks Rene
0 포인트
응답