Hello. I've got two servers. On one of them the BMC has stopped responding to IPMI queries/commands which caused an issue the other day when the OS crashed and the cluster was unable to fence that server.
I can get to the BMC web interface, and ipmiping gives a response. It just refuses to work with impitool. The other node works just fine.
Is there any way of working out/fixing this problem without opening up the server? It's a live production server and I'd rather not take it offline unless it's vital.
It seems to be at the latest version:
Firmware Revision : 1.18
Also, I can't see a way of doing that without powering the server down. I was hoping to be able to solve the problem without doing that as I'd have to arrange a maintenance window to switch the live website over to the other node.
I arranged a maintenance window and re-flashed the BMC firmware. I also removed the power cable for half a minute to ensure that the BMC rebooted.
I'm still getting the same problems; I can log into the BMC web interface, I can get IPMI sensor readings but I can't get/run any chassis commands over IPMI.
This node is part of our production cluster, so it needs to run IPMI chassis commands for STONITH. Does anyone have any other ideas of what might be wrong?
IPMI senor working implies the BMC is in full communications with the sensors..
Are you logged into the BMC at an administrator level?
I believe you can see this in the configuration tab, under users in the web interface.
if logged in as a user, you can read sensors and logs but not effect machine operation.
Yes, I'm logged in at admin level. On the web interface all the chassis function works - I can see the power state and power-off, restart etc.
It's via IPMI (using ipmitool) that can't run chassis commands, and that's also connecting as the admin user. This is for one node in a SR1670HV server. The other node's BMC works as expected. The chassis commands used to work on the problem node until a few weeks ago. The firmware re-flash didn't work, nor did removing power to the server for a while.
I'm pretty stumped about this.
IPMITOOL sensor should report all the system sensors.
Any differences in the sensors reported between the 2 nodes?
Any thing change around the time the node stopped allowing you to control it?
What is the specific command you are using and account name (see spec update errata 1) ?
(I am running out of ideas. Maybe someone else has some)
On further investigation, it turns out that the problem is confined to the impitool command. Other tools, such as ipmi-chassis and ipmi-sensors work. Unfortunately the cluster uses ipmitool to fence the node. I don't know why it's just ipmitool (and I've tried it from other servers and distributions) that doesn't work. Perhaps this is a protocol issue?
The workaround for now would be to configure the cluster to use a different module for node fencing.