NetEffect 10Gb Server 2012 SMBDirect BSOD under load (0x0D1)
I've built a proof of concept environment for Server 2012, Scale Out File Servers, SMBDirect (using iWarp) and Hyper-V 2012 nodes.
Essentially I've got 4 Scale out file servers that host the Fiberchannel CSV volumes (with CSV Caching enabled, 20% RAM), then share out the storage via Continuously Available shares using SMB3 and RDMA/SMBDirect.
Each File Server (4) and Each Hyper-V Server (6) have single 10Gb RDMA adapters (Hyper-V servers also use dedicated X520-DA2 NICs for VM networking). File & Hyper-V server RDMA adapters are on the same L2 VLAN on a common Cisco Nexus 5K switch.
Everything was working pretty well, until I reached about 250 concurrent VMs. Periodically, a file server node would BSOD (0x0D1, IRQ_NOT_LESS_OR_EQUAL, smbdirect.sys). But the file cluster handled these failures gracefully.
As I increased load further, Hyper-V servers started failing with the same error.
At one point in load, the hyper-v failures would cause VMs to fail over to other nodes, cause great load, and BSOD them (in a cascade that even happened in the file servers).
I was able to stabilize the environment by disabling NetworkDirect in the Adapter properties (essentially turning off RDMA), and have taken the workload to over 535 running VMs.
While I understand that the crashdump isn't directly pointing to the N2E63x64.sys driver, these errors are typically driver related. I am using the "latest" drivers (v22.214.171.124, 10/19/2012) and the issue only appears at load. I am fully patch compliant and have installed all recommended 2012 & Hyper-V Cluster Hotfixes outlined in KB2784261.
File servers are HP380G6 Servers (2x L5630, 48GB RAM), and Hyper-V servers are HP585G7 Servers (4x AMD 6172, 256GB RAM) Latest BIOS, drivers from HP.
Has anyone else seen similar behavior? And most importantly... how do we fix it?