Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4936 Discussions

1 year old server now randomly shuts down

Htech
Beginner
2,003 Views

I have a production Intel server installed in Dec 2020. Build sheet screenshot is below.  Worked fine for first year. But for about last 3 months has randomly powered down every few weeks. And will not power on unless power is disconnected for a few seconds from power supplies and then reconnected.

I swapped both power supplies but no change. 

In the bmc log i'll get:

SensorType:Power Unit SensorName:Pwr Unit Status Description:Power Unit Failure detected - Power supply failure detected - Asserted

that coincides with the shut down.

this happenned today at event 1346. i'll attach the seltext log.

Theres no warnings/logs immediately prior. Previous event before 1346 is from almost 2 weeks earlier.

Need direction as to how to troubleshoot further.

OS is vmware 6.5.

fred

 

BMC Summary:

Htech_0-1647103400370.png

Htech_2-1647103597834.png

 

 build sheetbuild sheet

 

0 Kudos
5 Replies
Victor_G_Intel
Employee
1,935 Views

Hello Htech,


Thank you for posting on the Intel® communities.


To continue with your request can you please provide an answer to the following questions:


1- Did you check if all power cables and adapters were connected properly (AC cables as well as the cables between the PSU and system components).


2-We understand that you swapped both the power supplies that were installed on the system,

but by any chance have you tried with a good known power supply from a different system if possible.


3-Did you verify that all connections of the system’s fans and HDDs were properly connected?


4-When the issue happened do you remember if when the server came back up any of the power supplies presented an amber light? Can you share a picture of the current status of both power supplies LEDs?


5-Can you please share with us the current status/stagging of the server (pre-live, maintenance mode or live)?


6-Please share the current environment of the server as well (Production, QA, Official Test, Lab).


Regards,


Victor G.

Intel Technical Support Technician


0 Kudos
Htech
Beginner
1,891 Views

1) Only external connections have been checked.

 

2) No I don’t have another system to swap/test power supplies.

 

3) No. But I’d expect an event to be logged if that was the case.

 

4) When server comes up power LEDs always show 1 solid green, 1 flashing green. I do believe they did show amber when server suddenly powered off.  And as mentioned before, power cords need to be removed from PS and reconnected or server will not power back on.

I’ve attached a brief mp4 file that shows 1 LED is solid green. 1 is flashing.

 

5) Live

 

6) production

0 Kudos
Victor_G_Intel
Employee
1,864 Views

Hello Htech,


Thank you so much for your response.


Based on the entries found in the log and the information that you have provided everything seems to point out a clear power issue; however, we don’t consider that the power supplies are the ones to blame for this issue since the issue is not consistent and the current LED activity on both PSUs is correct. In order to continue with the case please provide answers to the following questions.


1-Is the server connected directly to an outlet? If yes, during these three months of experiencing issues have you tried with a different outlet?


2-Is the server connected to the outlet through a UPS or a UPI? If yes, did you make sure the UPS being used is sine-wave compliant? If a UPI (patch panel) is being used, have you tested another one of the component’s ports?


Best regards,


Victor G.

Intel Technical Support Technician  


0 Kudos
Htech
Beginner
1,820 Views

1 power supply is connected to an APC SMT2200 UPS that outputs a pure sine wave. Power demand at UPS shows only about 200 watts. it can handle nearly 2000.

1 power supply is connected to an APC Surge Suppressor which is connected to a standard outlet, no UPS used on this one.  Previously, this one was connected to a different surge suppressor and that was removed after the second server outage. No change.

 

Both PSU are on a circuit dedicated to the server.

 

Of significance is that the APC has no power failure events recorded.  

 

I am having the circuits checked to see if a nearby outlet is NOT on that circuit for testing purposes. 

 

But that the UPS shows no fault and there are 2 power supplies for redundancy has me at a loss.  And while I know any logged entry can be incorrect, the power supply event is the only evidence.

What about the Power Distribution Board FUPPDBHC2? Could a failure of this component cause these issues while logging a power supply failure? Seems plausible.

 

   

0 Kudos
Victor_G_Intel
Employee
1,757 Views

Hello Htech,


Thank you for your response.


Actually, your supposition is correct the power distribution board can be the cause of the issue since it doesn’t really seem like a power supply issue; therefore, in order to move forward what we are going to do is to create a private ticket in our system for you in order to request some information from your end that can’t be shared through the forum for security purposes. Once you respond to that ticket, we will proceed with the replacement order there.


​​​​​​​Please wait for our email and don’t worry this thread will remain open in order for us to verify if the behavior presents itself again after the power distribution board has been properly replaced and tested.


Regards,


Victor G.

Intel Technical Support Technician  


0 Kudos
Reply