We have several 82546EB (dual-port gigabit PCI NIC) cards that we use in Linux "router" boxes. One of our links is a 10 Mbit/sec metropolitan Ethernet connection to a datacenter. The carrier presents us with an RJ-45 Ethernet jack manually set to 10/full, so we set the link to 10/full on the Linux side to match. This has worked without problem for some time. During a recent unplanned reboot of the router the link appeared to go dead. Since it was during the day there was a relatively high volume of traffic flowing over the link (or trying to). After a lot of headache we determined that the link would stop responding if (and only if) there was "a lot" of traffic during the first minute or so after the link was brought up. If the link saw only minimal traffic during the first 5 (being conservative) minutes after coming up then it would stay up indefinitely.
Other than traffic not flowing not much happens to indicate anything is wrong. The switch side does not see any events. The OS side sometimes logs one or several link down/up event pairs but the final state is always "up" whether or not it's actually working.
To recover, the link just needs to be brought down and up without a high packet volume. ifconfig down/up is sufficient, no physical link changes or rebooting is required. Occasionally the link will recover on its own after 3-4 minutes of inactivity (and numerous down/up link state messages), but this is not typical.
The issue is present with the Linux e1000 driver under FC6, CentOS 5.3 and CentOS 5.4. It is ALSO present on FreeBSD 8.0-RC2 (using the em driver). We have reproduced the issue with two different managed switches and two completely different computers (moving one of the cards between them).
The issue can be reproduced fairly easily doing something like this:
Set the Ethernet port at the far end (managed switch in our case) to 10/full.
Set the e1000/em interface to 10/full and assign IP's/routes as apropriate
Watch the system log if desired (in its own terminal): tail -F /var/log/messages
Start a flood ping (in its own terminal preferrably). e.g. ping -nf somehost
Observe minimal (FreeBSD) or no (Linux) packet loss = normal behavior
With the flood ping still running, take down the interface (e.g. ifdown eth2 or ifconfig em0 down).
Observe packet loss, no route to host, etc on the flood ping.
Bring the interface back up (e.g. ifup eth2 or ifconfig em0 up).
Watch the ping again. Packets start flowing shortly after the link comes up again but then usually stop after several seconds.
Continue to watch if desired. Most of the time nothing else happens (stays broken). Sometimes will go through several cycles of what appear to be adapter resets (link up/down messages, packets get through for a second or two after each reset). Sometimes when this happens the adapter will recover (packets get through after a reset and it doesn't die again).
To recover, kill the flood ping, take the interface down, bring it back up, observe that (light) traffic flows, wait a couple minutes, resume flood ping.
Since this behavior doesn't seem to be specific to any one NIC, driver, OS or computer/chipset my suspicion is it's something in the NIC hardware or firmware. We have only seen this behavior with manually-set links at 10/full. Autonegotiated links or manual 100/full links work fine. The number of people using gigabit cards at 10 Mb/s could understandably be small.. We haven't (yet) tested any Intel gigabit cards other than the 82546EB units we have.
We will be using other cards (probably PRO/100) as a workaround, but I'm wondering if Intel is interested in analyzing and potentially fixing this issue. Aside from being broken and annoying, it's a potential vector for a DoS attack. If anyone else has seen this or has any suggestions (or wild speculation) that would be welcome as well.
If I can/should provide any additional information just let me know what. Thanks!
Sorry to hear you're having troubles, but it sounds like you've been working on a solution. I'll ask around to see what I can find out. One thing that would be interesting to try would be to do the test back to back without the link partner in the middle. I know of a family of 10/100 parts that would react badly to auto-negotiation link pulses when set in a forced speed and duplex mode. These parts were very common in a series of managed switches (Names to be left out of it to protect the guilty*) and trying it out in back to back environment will help double check the switches you have aren't with the parts I'm thinking of. Send me a private message with the switch name and I'd be happy to help figure out if that is part of the problem.
As good as your description was I'm still a little unclear. Is it Carrier -> Switch -> 82546 or is its Switch=Carrier -> 82546. If its the first case, you can just have the 82546 do auto to the switch since it will buffer between the two. But in the second case your right to force when the other side is forced.
Are you okay with making some changes to the code? I might have some suggestions. He is the spec update for the http://developer.intel.com/design/network/specupdt/82546eb.htm 82546eb if you want to double check against your specific conditions. All the errata in that doc has the appropriate workarounds in the drivers, especially if you updated to the ones on http://e1000.sf.net Sourceforge.
*especially when they might not be guilty
The link partner is the carrier's switch (which is forced to 10/full). For testing purposes I have also reproduced the issue using our own (completely different) managed switch as a link partner. As a workaround we could insert our own switch between the carrier's switch and our router but, well, we aren't (opting to use different NIC's instead for now).
I'm not opposed to code tweaking for experimentation but we will not be using local driver patches on production equipment since it adds avoidable complexity to the maintenance and build processes. However if something gets pushed upstream that could be a different story.
We plan to use CentOS 5.4 for the box I'm testing with and new builds going forward. We have kernel version 2.6.18-164.el5. The (stock) e1000 driver version is 7.3.20-k2-NAPI according to ethtool -i. How does this compare to the 8.0.16 driver here?
Is the same version numbering scheme used for both? How do both of those compare to the drivers on Sourceforge? (I do see 8.0.16 available as a "stable" e1000 option but I'm not sure what I'd be getting with a bleeding-edge version.) I'll play with this if I have time.
I didn't see any exact matches in the errata document, though it's amusing that there is an issue (errata # 10) with 10/half and the suggested workaround is to use 10/full instead.