S2600TPR System Unstable

timmjm · ‎06-15-2021

Hello,

We have 36 or so machines with the S2600TPR server board.

One of them keeps rebooting frequently, typically once a week.

There is a lot of messaging in the BMC but I suspect this is the reason for the reboot.

PECI over DMI interface error. This is a notification that PECI over DMI interface failure was detected and it is not functional any more. - DMI timeout of PECI request - Asserted

I have updated to latest available BIOS and disabled C-states in the firmware (just googling pointed to a Dell issue which was fixed by doing this) neither of which has made a difference.

The system is running fairly heavy virtual machines on ESXi 6.7

Is there anything we can do to diagnose this further?

Debug logs are attached.

This appears to have been an ongoing issue for years on this particular server but it is now increasing in frequency.

Thank you

JoseH_Intel · ‎06-15-2021

Hello timmjm,

Thank you for joining the Intel community

The suggestion for this errors is to update BIOS to the latest available, which you have already done. The PECI is a thermal management feature which might suggest a possible overheating issue on any of the CPUs. I think this could be a good starting point. Unfortunately I cannot check on the Debug logs as they are password protected, so if you could extract and attach the SEL logs will be a lot easier to me.

I will look forward to your updates

Let me know if this helps

Regards

Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios

timmjm · ‎06-15-2021

Hi Jose,

Thank you for the response.

We will investigate the overheating scenario and check the heatsinks on each CPU as you suggest.

Have attached the SEL files if it helps confirm any further diagnosis.

JoseH_Intel · ‎06-15-2021

Hello timmjm,

I found no PECI errors in SEL log. Even though I found some temperature related errors like these couple ones:

EventID:0136 Time Stamp:06/07/2021 08:06:00 SensorName:BMC FW Health Sensor Type:Management Subsystem Health Description:'P1 Therm Ctrl %' sensor has failed and may not be providing a valid reading -Asserted

EventID:0160 Time Stamp:06/13/2021 10:47:57 SensorName:P1 Therm Ctrl % Sensor Type:Temperature Description:reports the sensor is high, critical, and going higher state -Asserted

Besides that I found some IERR errors that usually are related to memory

EventID:0148 Time Stamp:06/13/2021 10:44:41 SensorName:IERR Sensor Type:Processor Description:reports it has been asserted -Asserted

As suggested earlier I think that temperature checking would be a good start. But you can also check for memory like trying another good known ones just to discard.

I will look forward to your updates.

Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios

JoseH_Intel · ‎06-17-2021

Hello timmjm,

I am just following up to double-check if you found the provided information useful. If you have further questions please don't hesitate to ask. If you consider the issue to be completed please let us know so we can proceed to mark this ticket as resolved. I will try to reach you on next Tuesday 22nd. After that the thread will be archived automatically.

Regards

Jose A.

Intel Customer Support Technician

JoseH_Intel · ‎06-22-2021

Hello timmjm,

We will proceed to mark this thread as resolved. If you have further issues or questions just go ahead and submit a new topic.

Regards

Jose A.

Intel Customer Support Technician