IBM dx360 m4 hosts randomly (and frequently) NMI with Xeon Phi

Chris_Samuel · ‎05-24-2013

Hi there, In our new IBM iDataplex cluster which I'm building where I work we have 10 of the nodes with dual Xeon Phi's (B1PRQ-5110P/5120D) and running RHEL 6.3 with the latest MPSS & associated firmware and bootloader. We see constant issues where many of these randomly reset due to NMIs, only 3 of the 10 nodes have not succumbed to this yet. The console logs capture these thus:

Uhhuh. NMI received for unknown reason 2d on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue micscif_handle_lostnode 1380 node 1 Warning: Core image elf header not found Kdump: vmcore not initialized micscif_handle_lostnode 1392 node 1 crash dump failed status -22 mic0: Transition from state online to lost micscif_handle_lostnode 1407 stopping node 1 to recover lost node! micvnet_initiate_link_down timeout waiting for Tx dma buffers to drain micvnet_execute_stop: timeout waiting for link down message response br0: port 2(mic0) entering disabled state

and then the system resets itself. This can happen whilst the host node is idle or doing burn-in testing with HPL (but without the Phi being involved). Sometimes we don't even see the "Dazed and confused" message, we just get the following (and from the same host as the above):

[-- MARK -- Sat May 25 11:00:00 2013] micscif_handle_lostnode 1380 node 1 Warning: Core image elf header not found Kdump: vmcore not initialized micscif_handle_lostnode 1392 node 1 crash dump failed status -22 mic0: Transition from state online to lost micscif_handle_lostnode 1407 stopping node 1 to recover lost node! micvnet_initiate_link_down timeout waiting for Tx dma buffers to drain micvnet_execute_stop: timeout waiting for link down message response br0: port 2(mic0) entering disabled state

The host BMC (IMM in IBM speak) reports (matching the above):

05/25/2013 11:35:34 Critical Interrupt, Software NMI (NMI State)

These are the firmware versions on our cards:

Flash Version : 2.1.02.0386 SMC Firmware Version : 1.14.4616 SMC Boot Loader Version : 1.8.4326

Any ideas? All the best, Chris

Chris_Samuel · ‎05-26-2013

Could this be a result of errata CD47 and/or Intel Tracking ID: 4117447?

Both of those relate to NMIs and may be the same issue, just reported in different ways.

Frances_R_Intel · ‎05-28-2013

Yes, it is possible that the problem you are seeing is related to an NMI that results when the system is brought back from a power saving mode. I passed your issue on to the developers but for now, you could try disabling power management as a work around:

For each coprocessor, in the mic.conf file (where n is the coprocessor number), change the PowerManagement entry to:

PowerManagement "cpufreq_on;corec6_off;pc3_off;pc6_off"

Then

[bash]

service mpss stop

micctrl --resetconfig

service mpss start

[/bash]

Chris_Samuel · ‎05-28-2013

Hi Francis,

Thanks for that, I'd already disabled pc6 but not pc3, I've just applied that to the cluster now and we'll see what happens.

Is it usual for these NMIs to reset systems in this way? Or is that more likely to do with IBM's firmware?

All the best!
Chris

Frances_R_Intel · ‎06-03-2013

How was your system stability over the weekend with the all power setting off?

Frances

Chris_Samuel · ‎06-03-2013

Hi Francis,

We've not had a single NMI reset since disabling PC6 in addition to our already disabled PC3. Hooray! :-)

Thanks so much for the advice.

Is the plan to try and fix this in future MPSS releases to allow PC3 and PC6 to be safely enabled again?

All the best!
Chris