We just purchased an x520-2 (newest driver) to put in our Dell NX3100 NAS (Win 2008 Storage Server x64, fully patched) to send data across a backend switch (Dell Powerconnect 6224 w/10Gb SFP+ module) to a Centos 6 x64 bioinformatics server (also w/the same x520-2 card). However, when we put the card in and began to test with iperf (and netperf), we are consistently seeing pretty low performance (ie., never above 2Gb/sec w/iperf just set "iperf -s" and "iperf -c 10.0.0.x".
We noticed that on the NX3100 iperf always defaults to a window size of 64k, while our linux box window size is dependent on our settings in sysctl (currently it defaults to 4M/16M). If we run something like "iperf.exe -s -w 4M" / "iperf -c 10.0.0.x -w 4M" we can get improved numbers, but I'm not sure the real meaning of that (ie., it isn't trustworthy as a measure of what we are needing/wanting).
My question is tuning the Windows box, specifically things like TcpWindowSize. There is a significant amount of conflicting info out there regarding how this is done on Win2008/7 now that autotuning is enabled (ie, some say it ignores any TcpWindowSize regedits one makes, some dont). I can say that we have had very little success in making changes that we feel actually make any difference on the windows side.
Can someone definitively tell me how that tuning is done on the Windows side? (not just "turn autotuning to disabled") We are getting a bit frustrated.
Any help is much appreciated,
I have actually done more than a little bit of investigation into this area recently myself. All my reading and testing show that you cannot modify any of those settings in Windows Server 2008 or Windows 7. I would assume the same is true with Windows 8.
I too was digging into things finding out why performance was so poor between Windows and Linux, though in my case the Linux is running in a BMC and not a full processor. In my case I was able to adjust some parameters in the BMC and get about a 700% performance increase, however that again is for a BMC not a 'real' CPU. In my case I modifed net.core.tcp_rmem and net.core.tcp_wmem to align more with the Windows TCP Window sizing.
I noticed that Linux to Linux and Windows to Windows (in synthetic tests) were in line with expectations (basically line rate), but when you mix then things get ugly - my assumption is the OS's do different algorithms for trying to determine an optimal TCP Window size.
I'm very much looking forward to seeing if others have similar experiences.
In my limited experience with this card, it seems that using more Rx/Tx buffers boosts performance the most. If that does not do the trick, monitor per-CPU load with ProcessExplorer. I would set RSS Queues to 4, and specify starting RSS CPU on the second port to something other than 0 (it's 16 on the PC that I use, with 16 cores, 32 HTs).
Please report your findings, I don't believe this is optimal, but does the trick for me. Limitation here is MTU=1500. If I could use jumbo packets, I'd reduce the CPU interrupt and DPC load dramatically, and attain the 20 Gb/sec half-duplex speed.
Don't forget to test with RAM drive instead of an actual disk array to isolate NIC performance metrics.