Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage; and Intel® Xeon® Processors
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
4411 Discussions

S2600TPR System Unstable

timmjm
Beginner
265 Views

Hello,

We have 36 or so machines with the S2600TPR server board.

One of them keeps rebooting frequently, typically once a week.

There is a lot of messaging in the BMC but I suspect this is the reason for the reboot.

PECI over DMI interface error. This is a notification that PECI over DMI interface failure was detected and it is not functional any more. - DMI timeout of PECI request - Asserted

I have updated to latest available BIOS and disabled C-states in the firmware (just googling pointed to a Dell issue which was fixed by doing this) neither of which has made a difference.

The system is running fairly heavy virtual machines on ESXi 6.7

Is there anything we can do to diagnose this further? 

Debug logs are attached.

This appears to have been an ongoing issue for years on this particular server but it is now increasing in frequency.

Thank you

0 Kudos
5 Replies
JoseH_Intel
Moderator
244 Views

Hello timmjm,


Thank you for joining the Intel community


The suggestion for this errors is to update BIOS to the latest available, which you have already done. The PECI is a thermal management feature which might suggest a possible overheating issue on any of the CPUs. I think this could be a good starting point. Unfortunately I cannot check on the Debug logs as they are password protected, so if you could extract and attach the SEL logs will be a lot easier to me.


I will look forward to your updates 


Let me know if this helps


Regards


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios


timmjm
Beginner
237 Views

Hi Jose,

Thank you for the response.

We will investigate the overheating scenario and check the heatsinks on each CPU as you suggest.

Have attached the SEL files if it helps confirm any further diagnosis.

 

JoseH_Intel
Moderator
229 Views

Hello timmjm,


I found no PECI errors in SEL log. Even though I found some temperature related errors like these couple ones:


EventID:0136 Time Stamp:06/07/2021 08:06:00 SensorName:BMC FW Health     Sensor Type:Management Subsystem Health        Description:'P1 Therm Ctrl %' sensor has failed and may not be providing a valid reading -Asserted


EventID:0160 Time Stamp:06/13/2021 10:47:57 SensorName:P1 Therm Ctrl %    Sensor Type:Temperature                Description:reports the sensor is high, critical, and going higher state -Asserted


Besides that I found some IERR errors that usually are related to memory


EventID:0148 Time Stamp:06/13/2021 10:44:41 SensorName:IERR         Sensor Type:Processor                 Description:reports it has been asserted -Asserted


As suggested earlier I think that temperature checking would be a good start. But you can also check for memory like trying another good known ones just to discard.


I will look forward to your updates.


Jose A.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit:

https://intel.com/support/serverbios


JoseH_Intel
Moderator
213 Views

Hello timmjm,


I am just following up to double-check if you found the provided information useful. If you have further questions please don't hesitate to ask. If you consider the issue to be completed please let us know so we can proceed to mark this ticket as resolved. I will try to reach you on next Tuesday 22nd. After that the thread will be archived automatically.


Regards


Jose A.

Intel Customer Support Technician


JoseH_Intel
Moderator
205 Views

Hello timmjm,


We will proceed to mark this thread as resolved. If you have further issues or questions just go ahead and submit a new topic.


Regards


Jose A.

Intel Customer Support Technician


Reply