Re: S2600WTTR - DIMM Thrm Mrgn unknown --> BMC Health

BBosb · ‎12-23-2019

We have two identical Intel servers with S2600WTTR serverboards with same hardware configuration.

Both servers recently have a unknown status in Server Health.

Error:

Name - Status - Health - Reading

DIMM Thrm Mrgn 2 - Normal - Unknown - Not Available

Strange thing as well is that one server has 74 sensors and the other has 82.

One server also gives this error on DIMM Thrm Mrgn 4 where the one with less sensors does not report that sensor at all.

I can’t find much about the errors; found a topic here where support took over so here I am posting :).

Hope someone can help!

BMC FW Build Time : 2018-06-07 11:48:53

BIOS ID : SE5C610.86B.01.01.0027.071020182329

BMC FW Rev : 1.53.11210

Boot FW Rev : 1.07

SDR Package Version : SDR Package 1.17

Mgmt Engine (ME) FW Rev : 03.01.03.050

Thanks in advance!

JoseH_Intel · ‎12-24-2019

Hello BBosb,

Thank you for joining the community

You say this issue has appeared recently. Were both servers working fine before? When did this issue started? Have you performed any hardware or firmware changes recently?

Could you please attach the OEM and part number for the memory DIMMs so we can check in our records?

Besides that please attach a sysinfo log

Also there is an updated version of the BIOS v0028

We will look forward for you updates

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎12-24-2019

Hi Jose,

Thanks for the quick reply!

According to manual server

checks we perform it happened before so its been over two months with this

'error'. No hardware changes are done in the meantime. Also, no noticeable

problems running the server. When I

look through the documents of the check I see it also happened in the past but

was gone the next check.. (Maybe a simple reboot ‘fix’ this problem?)

Last time software/bios/firmware

round was when we installed Bios 0027. That must be some time ago; I need to check when this

exactly happened if that information is needed – during that we also update HBA

and NIC firmware..

These are storage servers which don’t

have a CLI or Shell so I can’t run sysinfo at this time.

We need to plan some downtime for this maintenance to do this is EFI.

We have six other ESX host which are running same

board with slightly other hardware configuration; these are running Bios 0028

but as far as we know these never had this error on BMC.

The two storage servers are running:

2x CPU E5-2609 v3 – ESX server have: CPU E5-2697A v4

4x Samsung M393A4K40BB1-CRC (32GB each, slot A1, B1, E1, F1) – The ESX host have these same modules, only 12 of them instead.

HBA: LSI SAS3008 - SAS9300-8E - 12Gbit – (not in ESX)

NIC: Mellanox ConnectX4 – (same)

I will try to do some maintenance and arrange a log file ASAP.

But maybe you can do something already with above information.

Thanks in advance!

Best regards,

-Bram

JoseH_Intel · ‎12-29-2019

Hello BBosb,

Thanks for the updates provided. We will check if any known issue could be found related to this error message. The sysinfo log will definitely help since it will gather a lot more info about the whole status of the server.

We could keep this thread open until your next maintenance window if you think it will happen in the near coming weeks. If you have an ETA will help.

We will get in touch with your soon.

Regards

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

JoseH_Intel · ‎01-02-2020

Hello BBosb,

Looking at the error message description in the board TPS, it should specify which processor it refers to with either P1 or P2 at the beginning of the error message, like:

Processor 1 DIMM Aggregate Thermal Margin 2 (P1 DIMM Thrm Mrgn2)

By any chance have you seen this info?

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-02-2020

Hi Jose,

Unfortunately, the only thing being displayed in the Health status https/bmc is:

‘DIMM Thrm Mrgn 1’

Maybe with other tools I can see different information but BMC/CLP (SSH) does not seem to be able to do much with this sensor, but I could be wrong?

What I can say which might be of some use is that a server reboot does not resolve this error.

What I will do this week or the next is boot remotely in EUFI and start the diagnostics. The only thing that might be an issue is that as far I tried EUFI does not detect a redirected ISO or USB with remote console. The trick I did is to make a ‘floppy image’ which does get detected but that won’t be writeable but I will see how that goes.

Is it wise to also update the BIOS? Do it after diagnostics of before (or both?)..

Thanks in advance.

JoseH_Intel · ‎01-03-2020

Hello BBosb,

Please try to take a sysinfo log before and after perform the BIOS update, which by the way is highly recommended.

We will wait for you outcome after performing those steps.

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-03-2020

I have sent you a PM with topic '0D50P00004XxcDnSAJ' containing download info.

BIOS is upgraded both logs are

attached. We still see the same message in BMC, and the one updated still has

the same amount of sensors (84) where the other servers only has 74..

I will update the BIOS on the

other server as well this weekend because we would like to keep the

configuration the same.

JoseH_Intel · ‎01-06-2020

Hello BBosb,

Thanks for the info. We will analyze both logs and will let you know the findings.

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-06-2020

Update on the BIOS upgrade of the other server which also had less sensors.

The update did fix the number of sensors (they both have 82 now), but also still has both ‘unknown’ sensors. I also send a PM containing logfiles from only after the update.

JoseH_Intel · ‎01-07-2020

Hello BBosb,

Thanks for the updates. At least we know that a firmware update fixes the number of sensors.

I was trying to look at your logs but they are password protected. Could you please reattach them.

Regards

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-07-2020

Password is provided in the PM

BBosb · ‎01-07-2020

The password is provided in the PM I have sent.

JoseH_Intel · ‎01-08-2020

Hello BBosb,

I found the password. My mistake.

After looking at the sysinfo log I found no errors. What I saw was some DIMM temperature related READING_UNAVAILABLE messages. Also there are some other DIMMs with negative temperatures values which does not make too much sense either.

Could you please confirm the RAM modules are Samsung M393A4K40BB1-CRC 32 GB?

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-08-2020

Yes they are all M393A4K40BB1-CRC, as mentioned before, we have 6 other ESX host with same board and 12 of these modules.

They do not display unknown status on DIMM Thrm Mrgn 1/2/3/4 but indeed have negative temperature values as well? Also the 'P1 DTS Therm Mgn' and (P2) values are negative but the rest is normal. but since its 'green' i always figured it just a visual bug that it displays a minus (-) before the value.

JoseH_Intel · ‎01-09-2020

Hello BBosb,

I will research further about these negative temperature values on DIMM sensors.

We will let you know soon

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-16-2020

Any update on this? Do you also want/need the logs of server with same board/memory which do have values on the sensors? Maybe that helps diagnose the problems of missing values?

As said before they also do have negative temperature:

DIMM Thrm Mrgn 1 ** Normal ** OK-59 degrees C

DIMM Thrm Mrgn 2 ** Normal ** OK-58 degrees C

DIMM Thrm Mrgn 3 ** Normal ** OK-55 degrees C

DIMM Thrm Mrgn 4 ** Normal ** OK-59 degrees C

JoseH_Intel · ‎01-28-2020

Hello BBosb,

I apologize for the delay. The following is the information provided by our engineering team:

"What you are seeing is normal.

DIMM thermal margin readings are not actual temperature readings of each DIMM installed in the system.

Temperature readings from each DIMM (TSOD) are aggregated into IPMI temp margin sensors for a group of DIMM's. What we see in log is a thermal margin reading for a group of DIMM (not single DIMM).

So in this case, DIMM A1 and DIMM B1 could be in the same group and E1 and F1 in another group.

Margin sensor readings will be negative value since its an offset to critical temperature. Its not actual temperature."

Hope it helps

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel

BBosb · ‎01-28-2020

Hi Jose,

No problem.

If the ‘Unknown sensor’ for these sensors are also normal which is main question you can set this topic as answered.

Best regards,

-Bram

JoseH_Intel · ‎01-29-2020

Hello BBosb,

Actually it is. The temp margin sensor is the IPMI group of DIMMs, so when it says unknown it refers to the whole aggregate not a single reading. Pretty much this is the way this sensor was designed to work.

Regards

Jose A.

Intel Customer Support Technician

A Contingent Worker at Intel