After many many requests and complaints about lack of support and/or documentation for support of lwIP for the Altera TSE, I have developed a drop-in TSE driver and example program and made this available to the NIOS II community. This was done for NIOS II 8.1 SP0.01. I don't expect difficulty with version 9.x.
This is for the latest version of lwIP (the latest is as of this post) for a minimal program and HTTP server based on the http server in the lwIP contrib folder. The lwIP TSE driver uses the altera_avalon_tse driver and SGDMA as-is. There is a complete (as in 41-step) set of instructions on creating the project and example program. More information and the link to the driver is available here: http://lwip.wikia.com/wiki/available_device_drivers#lwip_1.3.2 Please direct any questions, changes for NIOS II 9.1, or comments to this thread. 12-16-2010 update: This example works with NIOS Version 10.0 with some tweaks to the procedure to create the project. Also, a lwIP 1.4 release candidate has been out for a while and it drops into this example (in place of 1.3) without changes. Bill連結已複製
--- Quote Start --- DipSwitch, I think there is a problem in your (and Bill’s) driver. low_level_input() in altera_tse_ethernetif.c allocates pbufs for incoming packets with pbuf_alloc(PBUF_RAW, PBUF_POOL_BUFSIZE, PBUF_POOL). By default PBUF_POOL_BUFSIZE evaluates to 1080 and, consequently, longer packets corrupt memory. In my code I simply redefine PBUF_POOL_BUFSIZE to be 2000, but for universal driver, like yours, some more elegant way should be found. --- Quote End --- Hmm, lets see:
# ifndef PBUF_POOL_BUFSIZE# define PBUF_POOL_BUFSIZE LWIP_MEM_ALIGN_SIZE(TCP_MSS+40+PBUF_LINK_HLEN+ETH_PAD_SIZE)# endif
TCP_MSS = 1460 PBUF_LINK_HLEN = 14 ETH_PAD_SIZE = 2 To sum up: 1460+40+14+2 = 1516 Seems like no problems here?
--- Quote Start --- Hmm, lets see: TCP_MSS = 1460 PBUF_LINK_HLEN = 14 ETH_PAD_SIZE = 2 To sum up: 1460+40+14+2 = 1516 Seems like no problems here? --- Quote End --- Lets see again :) In the code DipSwitch posted on January 19th TCP_MSS is defined through the chain:
add_sw_setting decimal_number system_h_define memory.mem_size CONF_LWIP_MEM_SIZE 32768 "Size of the memory poll"# define MEM_SIZE CONF_LWIP_MEM_SIZE# define TCP_WND (MEM_SIZE / 8)# define TCP_MSS (TCP_WND / 4)
With all defaults TCP_MSS = 1024. In any case, I think you should agree, that “Ethernet frame length” is completely unrelated to “the size of the heap memory”, or “TCP Maximum segment size”, or “The size of a TCP window”. Mixing all this together misleading for user. For example, I spent last day debugging strange hangs in my code, which occurred irregularly after from 40 seconds to 1 hour of successful execution. That code doesn’t need much lwIP memory and doesn’t use TCP at all!
--- Quote Start --- With all defaults TCP_MSS = 1024. In any case, I think you should agree, that “Ethernet frame length” is completely unrelated to “the size of the heap memory”, or “TCP Maximum segment size”, or “The size of a TCP window”. Mixing all this together misleading for user. For example, I spent last day debugging strange hangs in my code, which occurred irregularly after from 40 seconds to 1 hour of successful execution. That code doesn’t need much lwIP memory and doesn’t use TCP at all! --- Quote End --- Sorry about that! In our project, we use 64Kb as mem_size so it didn't occur to me. But I do agree with you that they are unrelated. Maybe it would be nice to add those variable to the TCL script and find some way to generate errors while generating the BSP. Any hints on howto are welcome. And again, sorry for your waste of time!
--- Quote Start --- Basically yes. What about memcpy optimizations? Maybe worth adding a dma? --- Quote End --- DMA doesn't work reliably with the TSE. I would rarely get a bad byte in the copied data. Even so in my timing tests the timing improvement wasn't significant compared to memcpy - or I should say my memcpy which is highly optimized assembly code. Bill
--- Quote Start --- Hmm, lets see:
# ifndef PBUF_POOL_BUFSIZE# define PBUF_POOL_BUFSIZE LWIP_MEM_ALIGN_SIZE(TCP_MSS+40+PBUF_LINK_HLEN+ETH_PAD_SIZE)# endif
TCP_MSS = 1460 PBUF_LINK_HLEN = 14 ETH_PAD_SIZE = 2 To sum up: 1460+40+14+2 = 1516 Seems like no problems here? --- Quote End --- The defaults are correct. If TCP_MSS is changed I could see it being a problem. I wouldn't think NIOS II systems are so memory constrained to not allow a large MSS. Although we were mulling here over making an lwIP-based program entirely in onchip memory. So I could see that being a case. Bill
--- Quote Start --- The defaults are correct. If TCP_MSS is changed I could see it being a problem. I wouldn't think NIOS II systems are so memory constrained to not allow a large MSS. Although we were mulling here over making an lwIP-based program entirely in onchip memory. So I could see that being a case. Bill --- Quote End --- Bill, I agree that your defaults are consistent, but it is not obvious for the user that changing TCP_MSS also affects maximum allowable Ethernet frame size. More dangerous and hard-to-debug flaw is rx buffer overflow which may occur for small MAC core if PBUF_POOL_BUFSIZE < 1520. I see that setting TSEMAC_FRM_LENGTH to (PBUF_POOL_BUFSIZE+ETH_PAD_SIZE) in tse_mac_init() supposed to do the job, but this doesn’t work because “This maximum frame length is fixed to 1518 in 10/100 and 1000 Small MAC core variations.” So, it would be good for the driver to either ensure PBUF_POOL_BUFSIZE > 1520 in compile time, or to check that PBUF_POOL_BUFSIZE > TSEMAC_FRM_LENGTH in runtime and fail hard if this confition is not meat. Igor
--- Quote Start --- Created a git repository with the latest sources has been created. https://github.com/engineeringspirit/freelwip-nios-ii this version works stable but is far from optimized. --- Quote End --- I think it would be really helpful and would promote much more use of your additions if this was put on the lwIP Wiki: http://lwip.wikia.com/wiki/available_device_drivers I will check what you did for sure - I hope FreeRTOS use is optional in your installation. There are several ways to use lwIP without an RTOS and it's well documented that a significant performance hit occurs using an RTOS. Integrating a significant number of lwipopts.h settings into the Eclipse install would also help new users as well. I wonder if a lwIP template application was possible in the "New" menu? By the way, I use lwIP as a library - I don't know if you allowed for that. In fact I have the TSE driver as a LIB also but unfortunately it needs to be tied to a BSP. This makes it hard to use the TSE/lwIP combo in multiple projects that have (and normally would) different BSPs without duplicating the 2 LIBs. It's great to see your additions here. You really did more than I did and I'm thankful someone took it as a catalyst and ran with it like I wish I could have. lwIP 1.4.1 is due out shortly and 1.5 isn't far off with stable IPV6. Bill
Igor,
--- Quote Start --- I agree that your defaults are consistent, but it is not obvious for the user that changing TCP_MSS also affects maximum allowable Ethernet frame size. --- Quote End --- There is unfortunately a good bit of lwIP and lwipopts that is not obvious. --- Quote Start --- More dangerous and hard-to-debug flaw is rx buffer overflow which may occur for small MAC core if PBUF_POOL_BUFSIZE < 1520. I see that setting TSEMAC_FRM_LENGTH to (PBUF_POOL_BUFSIZE+ETH_PAD_SIZE) in tse_mac_init() supposed to do the job, but this doesn’t work because “This maximum frame length is fixed to 1518 in 10/100 and 1000 Small MAC core variations.” So, it would be good for the driver to either ensure PBUF_POOL_BUFSIZE > 1520 in compile time, or to check that PBUF_POOL_BUFSIZE > TSEMAC_FRM_LENGTH in runtime and fail hard if this confition is not meat. --- Quote End --- Then I agree it's too easy to break things with MSS changes and it shouldn't be left as-is. alteraTseEthernetif.c should include a# define with the right (calculated) size to be used in the pbug_allocs. I typically code that way when I know there are constraints. BillIt looks like the driver incorrectly handles chains of short pbufs. The following code produces invalid TCP packets:
usleep(1000000);
netconn_write(com->conn, "a", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "b", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "c", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "d", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "e", 1, NETCONN_NOCOPY);
usleep(1000000);
09:41:10.544021 IP 169.254.177.104.23 > IM-W7.1166: P 154:155(1) ack 30 win 5811
0x0000: 4500 0029 001d 0000 4006 95a2 a9fe b168 E..)....@......h
0x0010: a9fe dfaa 0017 048e 0000 1a08 33eb a83c ............3..<
0x0020: 5018 16b3 5833 0000 6100 0000 0000 P...X3..a.....
09:41:10.745999 IP IM-W7.1166 > 169.254.177.104.23: . ack 155 win 64086
0x0000: 4500 0028 014d 4000 8006 0000 a9fe dfaa E..(.M@.........
0x0010: a9fe b168 048e 0017 33eb a83c 0000 1a09 ...h....3..<....
0x0020: 5010 fa56 e52a 0000 P..V.*..
09:41:10.746866 IP 169.254.177.104.23 > IM-W7.1166: P 155:159(4) ack 30 win 5811
0x0000: 4500 002c 001f 0000 4006 959d a9fe b168 E..,....@......h
0x0010: a9fe dfaa 0017 048e 0000 1a09 33eb a83c ............3..<
0x0020: 5018 16b3 f266 0000 6200 0000 6300 0000 P....f..b...c...
0x0030: 6400 0000 65 d...e
(this last packet is never ACKed by the host and connection freezes)
With printf("<%d>", len); in tse_mac_raw_send() I see that the driver forwards to the sgdma the following pbuf sequence: <56><1> <56><1><1><1><1> According to Altera docs sgdma don’t like short transfers, and I think all these extra pad bytes are injected by sgdma. If this is the case, then the driver should provide some workaround, maybe by combining small chunks (if any) into single continuous sgdma request. Also, has anyone tried to use “On-Chip FIFO Memory Core” in place of sgdma for ethernet? I have never observed this behavior with copying tcp_write. But, does lwIP guarantee not to emit short pbufs for copying tcp flow? Igor
What happens if your writes are:
netconn_write(com->conn, "abcd", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "efgh"i+1, 1, NETCONN_NOCOPY);
netconn_write(com->conn, "ijkl"+2, 1, NETCONN_NOCOPY);
netconn_write(com->conn, "mnop+3", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "qrst", 1, NETCONN_NOCOPY);
The problem might be with misaligned transfers.
--- Quote Start --- What happens if your writes are:
netconn_write(com->conn, "abcd", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "efgh"i+1, 1, NETCONN_NOCOPY);
netconn_write(com->conn, "ijkl"+2, 1, NETCONN_NOCOPY);
netconn_write(com->conn, "mnop+3", 1, NETCONN_NOCOPY);
netconn_write(com->conn, "qrst", 1, NETCONN_NOCOPY);
The problem might be with misaligned transfers. --- Quote End --- Dsl, happens essentially the same – the same garbage on the boundaries of short writes. Bill’s driver handles misaligned writes by copying into temporary buffer on stack. I’ve attached complete session dump (for these tests I use shell.c from lwip contrib and just modify com_help()).
Igor & Dsl,
I'm starting to think that something else is going on here. I don't see problems here and now have lwIP in 4 Cyclone/NIOS-II products. I even implemented telnet as a debug console and have no problem with the 1-byte payloads that occurs when you type in Telnet. However, none of my chains are 1 byte because lwIP has made them bigger adding TCP/IP headers. Maybe you have an lwip option set which causes this? Also, I use PBUF_POOL_SIZE of 1518 so that I get as few chains as possible. I have 256MB of memory and have no concern over memory usage. Igor, what do you mean by: --- Quote Start --- Also, has anyone tried to use “On-Chip FIFO Memory Core” in place of sgdma for ethernet? --- Quote End --- BillOk - so that won't show if the inserted 'pad' bytes come from specific data bytes.
What I'd noticed is that the last frame ends with the correct byte (0x65) - so the problem isn't an obvious one where all 4 bytes are always written. The IP header frame length (0x00 0x2c) is correct for the expected frame data - but not for that actually added. An alternate hypothesis is that the MAC unit only looks at the byte enables in the fifo word at the end of frame. Whereas your data will have multiple unasserted byte enables embedded within the frame. If so, the software would have to be willing to build an entire frame.I'm happy to run something if you can post some code I can add to the lwIP example and produce the problem. It would determine if it's lwIP options or hardware. Or if it's a real problem with the driver and I'm willing to spend time to try to make it right if I can.
Note my programs are *all* RAW API based. I wrote my own streaming "socket" on top of it. I've been meaning to publish my "socket" because it is very efficient - it uses pbuf chains to hold incoming data and allows the application to read one or more bytes from the front of this pbuf list. Actually this isn't that hard to implement just going off me stating how I did it. BillBill,
I’ve attached a code. This is basically tcpecho_raw from lwip contrib, all the tests are inside echo_accept(). I run this under FreeRTOS port by DipSwitch with lwip 1.4.0 (don’t have runable non-os project on the moment), my lwipopts.h is also attached. It looks like if the first tcp_write is NOCOPY, then lwip doesn’t use oversized pbuf and forwards all the chunks straight to the driver. I fully support Dsl hypothesis about byte enables. According to Avalon-ST spec tse_mac must ignore empty[] in the middle of packet (see Chapter 5.3, empty signal description: “If endofpacket is not asserted, this signal is not interpreted”). Also, there is an anomaly in sgdma, which potentially may affect short transfers. Here is a quotation from Nios II EDS 11.1 errata: --- Quote Start --- Unaligned Transfers of Small Payloads Fail on SG-DMA Description The Scatter Gather DMA SOPC Builder peripheral does not correctly handle unaligned transfers with small payloads. A payload length smaller than the data width causes erroneous data transfers. Workaround Avoid using DMA devices to transfer small payloads. If absolutely necessary, for a 32-bit SG-DMA, a minimum length of 4 bytes guarantees that data is transferred correctly. --- Quote End --- As for the “On-Chip FIFO Memory Core” – this core in MM->ST configuration may perform similarly to sgdma for transmitting (we use sgdma synchronously, isn’t it?), but it does not suffer from the “short writes” anomaly. That is why I thought it can be used as a workaround. Unfortunately this doesn’t help with short writes in the middle of the packet. That is why I withdraw this “proposal”. IgorIgor,
Building your program (changing it to NO_SYS=1 and no RTOS) doesn't download - some debugger complaint about running 2 programs? Anyway, adding echo.c to my version of the example does run and I can confirm your finding. This is pretty bad - I don't know why I don't find this problem in real applications - even those using telnet. Anyway, you cannot send more bytes than required to meet the minimum 4 quantity. The only option is to copy the pbuf chains to a single pbuf. I think this sucks! But so it is. This works:
/* @Function Description - TSE transmit API to send data to the MAC
*
*
* @API TYPE - Public
* @param net - NET structure associated with the TSE MAC instance
* @param data - pointer to the data payload
* @param data_bytes - number of bytes of the data payload to be sent to the MAC
* @return SUCCESS if success, else a negative value
*/
err_t tse_mac_raw_send(struct netif *netif, struct pbuf *pkt)
{
int tx_length;
unsigned len;
struct pbuf *p, *q = NULL;
alt_u32 *data;
tse_mac_trans_info *mi;
lwip_tse_info *tse_ptr;
struct ethernetif *ethernetif;
unsigned int *ActualData;
/* Intermediate buffers used for temporary copy of frames that cannot be directrly DMA'ed*/
char buf2;
ethernetif = netif->state;
tse_ptr = ethernetif->tse_info;
mi = &tse_ptr->mi;
if(pkt->next != NULL) // Unwind pbuf chains
{
q = pbuf_alloc(PBUF_RAW, pkt->tot_len, pkt->type);
for(len = 0, p = pkt; p != NULL; p = p->next)
{
memcpy(q->payload + len, p->payload, p->len);
len += p->len;
}
pkt = q;
}
for(p = pkt; p != NULL; p = p->next)
{
data = p->payload;
len = p->len;
if(((unsigned long)data & 0x03) != 0)
{
/*
* Copy data to temporary buffer <buf2>. This is done because of allignment
* issues. The SGDMA cannot copy the data directly from (data + ETH_PAD_SIZE)
* because it needs a 32-bit alligned address space.
*/
memcpy(buf2,data,len);
data = (alt_u32 *)buf2;
}
ActualData = (void *)alt_remap_uncached (data, len<4 ? 4 : len);
printf("<%d @ 0x%08X/0x%08X>", len, (unsigned int)p->payload, (unsigned int)ActualData);
if(len<4)
len=4;
/* Write data to Tx FIFO using the DMA */
alt_avalon_sgdma_construct_mem_to_stream_desc(
(alt_sgdma_descriptor *) &tse_ptr->desc, // descriptor I want to work with
(alt_sgdma_descriptor *) &tse_ptr->desc,// pointer to "next"
(alt_u32*)ActualData, // starting read address
(len), //# bytes
0, // don't read from constant address
p == pkt, // generate sop
p->next == NULL, // generate endofpacket signal
0); // atlantic channel (don't know/don't care: set to 0)
tx_length = tse_mac_sTxWrite(mi,&tse_ptr->desc);
ethernetif->bytes_sent += tx_length;
}
if(q != NULL)
pbuf_free(q);
LINK_STATS_INC(link.xmit);
return ERR_OK;
}
You could optimize it by first checking if any chains are less than 4 and doing this addition only when one or more is, but I kept it simple - just do it if there are chains. Maybe it's risky but I intend to not change my lwIP-based products. They don't show the problem which I don't know why. Maybe I could add this code and it would never be called??? [Update: Probably should check for q != NULL and exit if so. Of course this will result in errors as well!] Bill
Bill,
Thank you for the confirmation and for the patch. With modified tse_mac_raw_send() both raw_api- and netconn-based tests work fine in my configuration. I think that your simple solution should be OK for TCP, but may severely impact performance of udp applications if someone tries to achieve zero-copy by chaining headers-only pbuf with ROM- or REF-type payload-only pbuf. The both pbufs could be properly aligned and unnecessary and expensive unwinding may become a bottleneck. I don’t use this approach in my code currently, but advertised it earlier in this thread. Below is my implementation of chain scanning as you suggested. I changed also to do unwinding in buf2[]. Is it ok, or you had good reasons to allocate separate pbuf for this purpose?err_t tse_mac_raw_send(struct netif *netif, struct pbuf *pkt)
{
int tx_length;
unsigned len;
struct pbuf *p;
alt_u32 *data;
tse_mac_trans_info *mi;
lwip_tse_info *tse_ptr;
struct ethernetif *ethernetif;
unsigned int *ActualData;
int unwind;
/* Intermediate buffers used for temporary copy of frames that cannot be directrly DMA'ed*/
struct pbuf unwind_pbuf;
char buf2;
ethernetif = netif->state;
tse_ptr = ethernetif->tse_info;
mi = &tse_ptr->mi;
unwind = 0;
for(p = pkt; p != NULL; p = p->next) {
if (((unsigned long)p->payload & 0x03) != 0 || p->len < 4 || ((p->len & 3) != 0 && p->next)) {
unwind = 1;
break;
}
}
if (unwind) {
// Unwind pbuf chains
if (pkt->tot_len > sizeof(buf2)) {
// no space for unwinding; drop the packet
return ERR_OK;
}
for(len = 0, p = pkt; p != NULL; p = p->next)
{
/*
* Copy data to temporary buffer <buf2>. This is done because of allignment
* issues. The SGDMA cannot copy the data directly from (data + ETH_PAD_SIZE)
* because it needs a 32-bit alligned address space.
*/
memcpy(buf2 + len, p->payload, p->len);
len += p->len;
}
unwind_pbuf.len = unwind_pbuf.tot_len = len;
unwind_pbuf.payload = buf2;
unwind_pbuf.next = NULL;
pkt = &unwind_pbuf;
}
for(p = pkt; p != NULL; p = p->next)
{
data = p->payload;
len = p->len;
// No need for re-alignment and length-checking here
ActualData = (void *)alt_remap_uncached (data, len);
//printf("<%d @ 0x%08X/0x%08X>", len, (unsigned int)p->payload, (unsigned int)ActualData);
/* Write data to Tx FIFO using the DMA */
alt_avalon_sgdma_construct_mem_to_stream_desc(
(alt_sgdma_descriptor *) &tse_ptr->desc, // descriptor I want to work with
(alt_sgdma_descriptor *) &tse_ptr->desc,// pointer to "next"
(alt_u32*)ActualData, // starting read address
(len), //# bytes
0, // don't read from constant address
p == pkt, // generate sop
p->next == NULL, // generate endofpacket signal
0); // atlantic channel (don't know/don't care: set to 0)
tx_length = tse_mac_sTxWrite(mi,&tse_ptr->desc);
ethernetif->bytes_sent += tx_length;
}
LINK_STATS_INC(link.xmit);
return ERR_OK;
}
Maybe you could accumulate the data sent by the user in an aligned buffer until the tcp stack actually sends the data?
For TCP it might be ok (if slightly unexpected) to resend the previous 1-3 bytes in order to always do aligned sends. If you are willing to do that, then the application send() could write directly into a pre-allocated buffer used for retransmissions and indexed by the tcp byte sequence number. Might be worth doing that anyway - but with a single buffer for 'realignment' transmits.Dsl, I think that the easiest application level workaround is simply to not use zero-copy versions of tcp_write and netconn_write. This is appropriate for me because I need TCP for primitive debug console only. Other workarounds are also possible of course.
Guys,
Comments: First, pbuf payload is guaranteed aligned. I use custom version of this driver (more optimized and improves on SGDMA and PHY handling plus some bug fixes in the Altera code) and took out the alignment check and added an assertion and it never asserts. If user code sets payload (a bad practice) than it's not guaranteed, but pbuf_alloc aligns payload. An application that uses UDP for high data rates is probably using a custom client in which case the payload size could be enforced. I use a zero-copy UDP high speed reliable protocol (the hardware writes in the UDP payload and checksum) and in this case the payload is large and always aligned. You only need to unwind if pkt->next isn't NULL. An unchained pbuf can never have a payload size less then 4 because of the IP header. I'd put back the pkt->next != NULL test before the for loop. Minor:
// Unwind pbuf chains
if (pkt->tot_len > sizeof(buf2)) { // no space for unwinding; drop the packet
return ERR_OK; }
Can't occur. lwIP will never chain more than the MTU. It can't because 802.11 cannot support it. This bug in the SGDMA (it's a bug if you can't sent 1-x bytes) maybe why InterNiche is so slow and full of copies. It's too bad it cripples lwIP which otherwise is much more efficient. I wonder if we can write the small payload bytes right to the MAC. I.e. don't use SGDMA but use a for loop to write 1 to 3 bytes to the MAC buffer (same place the SGDMA writes to)? Does either of you or anyone know if this could be done? I think I'll put in a service request for this as a SGDMA bug. Great discussion - thanks! Bill
