Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage; and Intel® Xeon® Processors
4482 Discussions

S2600WFT, Sensor Readings - DIMM Thrm Mrgn - not available, no reading

mradon
Beginner
1,082 Views

We have two servers with identical boards S2600WFT, but different RAM setup. In one of the servers there is no reading for DIMM Thrm Mrgn 1 to 4.

In "Server Health -> Sensor Readings"

Unknown DIMM Thrm Mrgn 1 Not Available No Reading
Unknown DIMM Thrm Mrgn 2 Not Available No Reading
Unknown DIMM Thrm Mrgn 3 Not Available No Reading
Unknown DIMM Thrm Mrgn 4 Not Available No Reading

(see attached screenshot). 

Screen Shot 2022-02-09 at 20.29.28.png

In the second server, everything is fine with DIMM Thrm Mrgn sensors. Also all other sensors are readable and the total number of sensors is identical in the two servers, as it should be for the same mainboard.

The BIOS and BMC firmware on the server, where these "DIMM Thrm Mrgn" sensors are unavailable, is updated to most recent version (BIOS 02.01.0014, BMC fw 2.86.2da97d3f, SDR 2.02, ME 04.01.04.505). The problem persist after restarting the server. The installed RAM is Samsung M393A4K40CB2-CTD (all 24 DIMMs populated).

 

Is the lack of reading from all these "DIMM Thrm Mrgn" sensors something to worry about?

How the problem could be solved?

 

Thanks in advance!

Mariusz

 

0 Kudos
18 Replies
SergioS_Intel
Moderator
1,067 Views

Hello mradon,


Thank you for contacting Intel Customer Support.

 

We understand that you are getting Sensor Readings - DIMM Thrm Mrgn - not available, no reading on your Intel® Server Board S2600WFT.


I will be more than glad to help you today.


Could you please provide us with the Debug Logs? These logs can be extracted by going to BMC Console > System Information > System Debug Log > Generate Log. 


Also, please provide us with the SEL logs. These logs can be extracted by going to BMC Console > Server Health > Event Log > Save Event Log. 


Looking forward to your updates.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


mradon
Beginner
1,051 Views

Dear Sergio,

Thank you for your reply. I attached the Debug Log and SEL logs to my support request (ticket number 05367174).

 

I am looking forward to hearing from you.

 

Best regards,

Mariusz Radon

 

 

 

 

 

 

SergioS_Intel
Moderator
1,042 Views

Hello mradon,


We appreciate the additional information, after checking the logs we did not find any errors. 


Have you tried testing different memory on your system or swapping the memory from the system that is not giving you errors?

 

This will tell us if the issue follows the memory DIMM or the memory slot.


Looking forward to your updates.


Best regards,

Sergio S.

Intel Customer Support Technician


For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


SergioS_Intel
Moderator
1,033 Views

Hello mradon,


We are following your case and would like to know if you need further assistance.


Looking forward to your updates.


Best regards,

Sergio S.

Intel Technical Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios 


mradon
Beginner
1,015 Views

Dear Sergio:

Thank you and apologies for late reply. Yes, of course I need your further assistance to solve this case!

You wrote "after checking the logs we did not find any errors". Does it mean that the lack of reading from all "DIMM Thrm Mrgn" (1 to 4) sensors is acceptable on this mainboard? Otherwise, the system is functionally normally.

Regarding your questions: "Have you tried testing different memory on your system or swapping the memory from the system that is not giving you errors?" and  "This will tell us if the issue follows the memory DIMM or the memory slot", I am not sure if I undestood you properly... How the lack of reading from the DIMM Thrm Mrgn sensors can be related to problems with specific DIMMs and/or memory slots? This system has 24 DIMMs (all of them populated by Samsung M393A4K40CB2-CTD), but there are only four DIMM Thrm Mrgn sensors.

If you were speaking about ECC errors present in the system event log, which I provided: these ECC errors appeared only after intense computation or memtest  and were not fatal; these errors were indeed following specific DIMM (not memory slot). All the DIMMs which generated these ECC errors were replaced and there is still no reading from all four DIMM Thrm Mrgn sensors. I am not sure if these ECC errors are related in any way to the lack of reading from the DIMM Thrm Mrgn sensors  (it might be just a coincidence: due to the ECC errors I took this system under big scrutiny, and I hence found the lack of reading from the DIMM Thrm Mrgn sensors).

Looking forward and thank you in advance!

 

Best regards,

Mariusz Radon

 

 

 

 

 

 

SergioS_Intel
Moderator
1,010 Views

Hello mradon,


Thank you for the information, please allow us to check on your issue and we will get back to you. 

 

 

 Best regards,

 Sergio S.

 Intel Customer Support Technician

 

 For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios

 


SergioS_Intel
Moderator
991 Views

Hello mradon,


Thank you for waiting for our updates.

 

DIMM thermal margin readings are not actual temperature readings of each DIMM installed in the system.


Temperature readings from each DIMM (TSOD) are aggregated into IPMI temp margin sensors for a group of DIMMs.


What we see in the log is a thermal margin reading for a group of DIMMs.


Please check the following link for System Event Log Troubleshooting Guides for Intel® Server Boards:


https://www.intel.com/content/www/us/en/support/articles/000006888/server-products.html


5.2.2 Thermal Margin Sensors for more details.


Now, after checking the logs, we noticed one event CPU: 1, DIMM: D1 error back on 2/8, and there are no other related entries. 


Can you please clear the event log and monitor the system to see if they see any other ECC error on that slot?


Finally, please provide us the following information:


1. Could you please provide us with the details of the current environment of the server (Production, QA, Official Test, Lab)? 

2. May we know what is the status/staging of the server (pre-live, maintenance mode or live)?  


Looking forward to your updates.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


mradon
Beginner
959 Views

Dear Sergio:

The error you spotted was an ECC error from a single DIMM (CPU1_DIMM_D1). This module was replaced on 2022-02-16. But nothing has changed with respect to the lack of reading from the "DIMM Therm Mrgn" sensors. If you need, Debug Logs and Event Log were generated today (many days after that DIMM was replaced) - these files are attached to my request # 05367174.

 

Regarding your question "Can you please clear the event log and monitor the system to see if they see any other ECC error on that slot?" Of course, I will monitor the system for ECC errors. However, I prefer not to clear the event log (to keep as much history as possible); I am aware that SEL is becoming a bit long, but newer events can be easily distinguished from older ones by date and time.

 

Regarding your comment: "Please check the following link for System Event Log Troubleshooting Guides for Intel® Server Boards":

I have seen this document, in particular section 5.2.2, but I cannot see how it can be helpful for my issue. The problem is that I do not have any reading from "DIMM Therm Mrgn" sensors, not that I some logged SEL events from these sensors.

 

Finally, regarding you question "1. Could you please provide us with the details of the current environment of the server (Production, QA, Official Test, Lab)?  2. May we know what is the status/staging of the server (pre-live, maintenance mode or live)? ":

If I understood correctly: 1- Production.  2 - live.

 

Looking forward to hearing from you.

 

Best regards,

Mariusz

 

 

 

 

 

 

SergioS_Intel
Moderator
890 Views

Hello mradon,


Thank you for the information, please allow us to check it and we will get back to you.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


SergioS_Intel
Moderator
862 Views

Hello mradon,


Thank you for waiting for our updates.


In order to continue assisting you with your problem, can you please provide us more details when you mentioned "but a different RAM setup" between the 2 servers?.


Are both servers using different DIMM models? You mentioned that the affected server is using Samsung M393A4K40CB2-CTD.


What's the DIMM model of the other server?


Also, are the 2 servers running with the same BIOS/firmware?

 

Looking forward to your updates.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


mradon
Beginner
849 Views

Dear Sergio:

Server 1 (problem: no reading from DIMM Thrm Mrgn sensors): 24x32 GB M393A4K40CB2-CTD, BIOS R02.01.0014 (latest).

Server 2 (no such problem): 14x64GB HMAA8GR7A2R4N-VN, BIOS R02.01.0013

 

Looking forward,

Mariusz

SergioS_Intel
Moderator
813 Views

Hello mradon,


Thank you for the information, please allow us to check it and we will get back to you.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


Victor_G_Intel
Moderator
800 Views

Hello mradon,


Thank you so much for your patience.


To continue with the situation, can you please let us know if you can try the steps below to deal with the issue at hand.


Steps


1. Remove all DIMMs from Server 1 and install only one HMAA8GR7A2R4N-VN DIMM on Server 1 (problem: no reading from DIMM Thrm Mrgn sensors). Now, is there any reading from the DIMM Thrm Mrgn sensor?


2. If no reading is received on step 1, re-update SUP R02.01.0014 and make sure to update both FRU and SDR with the SUP. Now, is there any reading from the DIMM Thrm Mrgn sensor after re-updating?


3. Remove all DIMMs from Server 1 and install only one M393A4K40CB2-CTD DIMM on Server 2 (no such problem). Is there any reading from the DIMM Thrm Mrgn sensor?


We will be waiting for your response.


Best regards,


Victor G.

Intel Technical Support Technician


mradon
Beginner
735 Views

Dear Victor:

Thank you, I'll try to perform these tests during the next maintenance break and I'll report the result to the forum. But this won't be soon: these servers are production machines and I cannot turn them off just to make such testing.

Please, don't close the thread.

 

Best wishes,

Mariusz

Victor_G_Intel
Moderator
712 Views

Hello mradon,


Thank you so much for your response.


Please take as much time as you need. We will monitor the thread for now and we will wait for the outcome of the steps provided.


Best regards,


Victor G.

Intel Technical Support Technician  


SergioS_Intel
Moderator
628 Views

Hello mradon,


We are following your case and would like to know if you were able to perform the steps provided before.


Looking forward to your updates.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


mradon
Beginner
620 Views

Dear Sergio,

As I told before, this testing will have to wait until the next maintenance break. This could be realistically in several months.

Note that to perform the tests you requested, I will need to power down (at the same time) both of our servers, which are production machines. I will let you know about the results, but this won't be soon.

 

Best regards,

Mariusz

SergioS_Intel
Moderator
616 Views

Hello mradon,


We appreciate the additional information. We will be looking forward to your updates.


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


Reply