Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4761 Discussions

S2600ST sensor "VRD hot" asserted - but nothing is actually running hot

UlrichP
Beginner
5,348 Views

Hi all,

 

maybe someone has an idea re. VRD hot asserted. My system is a S2600STBR with XEON Silver 4210 and Kingston 4*KSM26RD8/16HDI. System is running well, i do not see any issues except for the "System Status LED" is blinking amber ( 1 second frequency ). It's about the VRD hot sensor: SEL says "CPU2, DIMM Channel 1/2" ) but CPU2 and DIMM are not even populated !

One can touch all heatsinks on board, they are not even warm ( and definitely not hot ).

Anyhow, i installed 2 additional chassis fans in server case, but it does not change anything.

I already updated system from initial firmware to latest BIOS/BMC Package ( 02.01.0014 ) but no change. 

Well, i thought populating CPU2 and DIMM Channel 1/2 might change the game, but it doesn't. CPU2 and DIMM Channel 1/2 are recognized and run w/o issues.

I tried resettig CMOS ( by jumper on S2600STBR ) and also by BMC "Reset to factory" - no luck.

It's the same when resetting BMC ( via ipmitool mc reset cold ).

Any idea how to remedy that? If you need more information, just ask.

Just for completeness; power supply unit can deliver 750W, CPU1+2 and MB power are connected, the additonal 4-pin "12V aux power" is not. I hope, this is not the source of my trouble.

 

Thanks in advance, Ulrich 

0 Kudos
32 Replies
Paul_R_Intel
Moderator
3,739 Views

Hello UlrichP, 

 

Thank you for joining the community,


I understand that you have "VRD hot" asserted on your Intel Server Board S2600STBR. we would need to collect details regarding your system and get more information so we can further investigate.


Therefore, please use our tool to generate a System Event Log and send it to us for checks.


System Event Log (SEL) Viewer Utility


Please check this article on how to extract the SEL log for your Intel Server Board:


We would like to know if you are using an intel Chassis and would recommend verifying that all fans, including the processor heatsink/fans, are reseated.


We will be waiting for your response.

 

Regards 

 


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 


0 Kudos
UlrichP
Beginner
3,730 Views

Hi Paul,

 

Thanks for quick response. Well, i do not use an Intel chassis - just a  Big Tower case of suitable size.

Please find attached the current SEL of my system. A couple of notices in addition:

I initally updated from factory firmware to most recent BIOS R02.010014 package.

As i found in the forum, its better to update sequentially version by version - i downgraded to BIOS 0015, ( incl. clear CMOS, set BIOS to  defaults and BMC reset to factory). Afterwards i did each update since 0015 ( 0016, 02.01.0008 .. 02.01.0014 ).

When update to 02.01.0012 has finished, the BMC status LED went green! In that case BMC update was actually run twice: first time right after startup of startup.nsh - and a second time right after FRUSDR update. Seems to be an exceptional behaviour of R02.01.0012 startup.nsh... 

But once i proceeded with update 02.01.0013 and 02.01.0014, the BMC status LED went back to blinking amber :-(.

Btw. downgrade to 0015 and 0016 did not went fine - these releases do not like my memory ( even worse, it reads "no memory" ).

So i used the phantastic BIOS recovery feature in order to get into UEFI Shell for next update sequence...

Re. fans: i do use Intel original accessories:  One BXSTS300 °C and a AXXSTPHMKIT2U both with carrier clip. CPU1 has the (older) BXSTS300 °C installed, CPU2 does have the AXXSTPHMKIT2U. Maybe one ugly thing is, that these two heatsinks have the fan installed at the opposite side - so CPU2 fan is blowing the air to CPU1. Normally, two heatsinks of same make and model would not "fire" against the other ( right ?). In addition i have two chassis fans connetced to chassis fan connector 1 and 7 ( simple fans w/o PWM ). 

Anyhow - VRD hot was asserted prior to installing CPU2 and DIMM into CPU2 DIMM slots A1,B1,D1 and E1 and the simple chassis fans. So root cause must be something different.

Hope this helps. 

 Regards, Uli

0 Kudos
Paul_R_Intel
Moderator
3,724 Views

Hello UlrichP, 

 

Thank you for the update, we will further investigate and I will get back to you as soon as possible.

 

Regards 


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 


0 Kudos
Paul_R_Intel
Moderator
3,706 Views

Hello UlrichP, 

 

Thank you for your patience and time, after analyzing the issue, we believe that using a non-validated chassis and Fans, we suspect the sensors from BMC could be getting false readings due to non-validated parts.


The only further test is to set the board with minimum configuration ( 1 CPU, basic DIMM) and see the results. Clean the BMC logs and send a new set of logs for further analysis.


Regarding the power supply, Intel suggests 750W AC power and 80 Plus Platinum efficiency (80 Plus Platinum: 90% efficiency @ 20% load; 92% efficiency and power factor of 0.95 @ 50% load; 89% efficiency @ 100% load.) What certifications does your power supply have, can you provide the model number?


Regards 


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 




0 Kudos
UlrichP
Beginner
3,680 Views

Hello Paul,

 

first: i did further tests as you adviced, but no luck. Please find SEL ( this time taken from BMC Web ).

I think its time to talk about aternate approaches.

First and most important question: can i savely operate this Board despite the "false" reading of VRD hot sensor? Any thoughts re. this?

Second, I do have three S2600STBR boards, none of them mounted into a chassis - the other two work very well ( green ).

What i observed is a difference in My Boards reads: J17012-600 the other - working ones - are older Models: J17012-551 and J17012-552. Can this make all the difference?

Third, my PSU is a Be Quiet! STRAIGHT POWER 11, 750W, Model: E11-750W. It does have a 80+ Gold certification. The very same model is used for all other systems - so its maybe a faulty PSU in my new system - but in general, this PSU works fine. ( even though its not certified ).

Forth question: operating a S2600STBR without chassis and FRU belonging to it should be possible w/o issues. Is there a "must" to have it in oder to ensure proper operation and - most important - get right readings from BMC sensors? As i have two other systems with same setup that work fine - i am confused ;-).

Fifth question: Did you had a look at memory modules i do use?  In the current system i do have KSM26RD8/16HDI as the recommended ones are no longer available. In the other two systems i have  certified KSM24RS8/8MEI. Main difference here is single rank with 1Gx72bits Config vs. dual rank and 2Gx72bits Config. Kingston recommends KSM26RD8/16HDI for S2600ST, but there is no certification from Intel. Is there some knowledge of false BMC sensor readings due to use of not certified ( but technically/electrically ) suitable memory modules? BIOS is fine with memory, saying "No errors found".

Last question: is there a way to reset all BMC settings including SDR / FRU area, SSL Certificate etc. to factory settings? I do observe, that a couple of settings ( the above mentioned ) do persist - even when pulling CMOS battery, set BIOS to defaults (F9),  set BMC to factory defaults. This "clear all" might be a hidden feature - but helpful in that case...

Dear Paul, i know its a lot of input - but this issue really drives me crazy as I do not have any clue how to fix it. I never had such issues before... Hope you can help.

 

Kind Regards, Uli

 

P.S: re. SEL: i started with clear CMOS, BIOS defaults and BMC defaults and CPU1 + 1 DIMM only, later added CPU2 and all 8 DIMMs. Maybe this clarifies questions when analysing SELlog.

0 Kudos
Paul_R_Intel
Moderator
3,669 Views

Hello UlrichP, 

 

Thank you for your patience and time, after analyzing the issue, we see that the issues persist, we would like to know the following:


  • Have you tried to replicate the environment with what you have up and running, this means swapping the PSUs, memory sticks and taking this board out of the chassis just for testing?



In regards to how to reset the BMC here you have the instructions to follow:


https://www.intel.com/content/www/us/en/support/articles/000029866/server-products/server-boards.html


It is definitely not recommended to run the server out of a validated or without a chassis at all. This is the chassis that's been validated


Intel® Server Chassis P4304XXMUXX



I am investigating the other questions that you have and I will get back to you as soon as possible!


Regards 


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 


0 Kudos
Paul_R_Intel
Moderator
3,655 Views

Hello UlrichP, 


Hope you are doing great, I would like to know if you were able to check the previous post.


Regards 


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 




0 Kudos
UlrichP
Beginner
3,650 Views

Hi Paul,

 

thanks for inquiry. Just a few points today:

 

replicate environment (using existing parts like PSU & RAM from other systems): 

Short anwser: not yet. The other two are in productive use out there - so i can not just grab parts from it. Have to wait until next planned maintenance with downtime.

 

BMC reset:

actually i asked for some "total" reset - the one described i already know, just doing a partial BMC reset, but keeping policy, some security settings and all sensor readings. The point is to get rid of any memories of false sensor readings, which this "soft reset" don't do.  

 

Chassis P4304XXMUXX:

As of now my board says "No FRU found". Does this chassis comes with the desired FRU parts? As "Field Replacable Unit" is quite a big term, i am uncertain, whether i would solve that "FRU is missing" when updating FRU/SDR areas by using that certified chassis? Does it solve that issue? But still not knowing, whether this would heal the false reading of VRD hot... Quite a number of bucks to spent just to answer a - w/o any doubt - really interesting question  

 

My most important question is still: May I operate that board w/o harm ( simply ignoring the false reading ) ?

As source of that reading is not identified ( or simply false ), i would love to discuss that question.

Would this - as another alternative - be a reason to go for an RMA? Maybe this is simply a faulty board (manufactured on a Monday ;-)?

 

Regards, Uli

0 Kudos
Paul_R_Intel
Moderator
3,644 Views

Hello UlrichP, 

 

Thank you for the update, we will further investigate your inquiries and I will get back to you as soon as possible.

 

Regards 


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 


0 Kudos
Paul_R_Intel
Moderator
3,611 Views

Hello UlrichP, 

 

Thank you for your patience and time, based on our investigation the sensor readings are active data that is frequently updated and doesn't need to be clear, if you'd like to clear the logs you can do it using the following steps:

 

renditionDownloadbmc.jpg

 

This missing FRU information has to be gone when using Intel® Server Chassis P4304XXMUXX but is also needed to use validated Intel Fans as well for this chassis (Fans part name: FUPMLHSFAN).

 

Additional Note: You will need to run an SDR firmware update using the firmware package for the S2600STBR to update this field in order to receive the FRUSDR information (Latest version here)

 

The board could work as usual but with the consequence of not knowing with certainty when these false alarms could be true and create a bigger problem where even the failure is ignored and in the long term causing a bigger problem without being able to know with precision where the failure is since we would not know which alerts are true and which are false.

 

So we recommend you to set up the environment only using validated parts so we can confirm that indeed the board is faulty before claiming the warranty.

 

Please let us know if there is anything else that we can do for you.

 

Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 

 

 

 

0 Kudos
UlrichP
Beginner
3,604 Views

Hello Paul,

 

thanks for further investigation. I ordered a P4304XXMUXX and a FXX750PCRPS #915604 which will be delivered this week (cw4).

I think you are absolutely right, claiming warranty requires using validated parts only.

I keep you posted about the outcome once the chassis is in use...

Have a great week ahead,

 

Rgeards, Uli

0 Kudos
Wanner_G_Intel
Moderator
3,598 Views

Hello UlrichP,


Thank you for your update. 


We look forward to hearing from you after you receive the validated parts.


Wanner G.

Intel Customer Support Technician


0 Kudos
UlrichP
Beginner
3,587 Views

Hi Wanner G.,

 

ok, validated parts have been delivered quite fast - and are in pace already.

But - I am frustrated - no change at all, the status LED keeps blinking amber.

What i did is - after assembly - BIOS Defaults ( F9 ). BMC reset to factory defaults.

CMOS reset by jumper ( 5s during powered off and detached from power ).

Then Boot-Manager F6 : EFI Shell - Doing FRU+SDR update by UpdS2600STBFRUSDR.nsh 

As i do try to get as much information as possible, i also enabled alerts by mail.

Here is what i got after reset:

Mail 01/25/2022 19:53

Event that generated this alert:

RID:0025 TS:01/25/2022 19:52:44 SN:OOB FM update ST:System Event ED:OEM System Boot Event ET:Asserted EC:OK

RID:0025 RT:02 TS:61F046FC GID:0001 ER:04 ST:12 S#:83 ET:6F ED:01 FF FF EX:00 00 00 00 00 00 00 00

Mail 01/25/2022 19:59

Event that generated this alert:

RID:0036 TS:01/26/1976 19:58:11 SN:OOB FM update ST:System Event ED:OEM System Boot Event ET:Asserted EC:OK

RID:0036 RT:02 TS:0B6A86C3 GID:0001 ER:04 ST:12 S#:83 ET:6F ED:01 FF FF EX:00 00 00 00 00 00 00 00

Mail 01/25/2022 19:59

Event that generated this alert:

RID:0038 TS:01/26/1976 19:58:11 SN:POST Err Sensor ST:System Firmware Progress ED:System Firmware Error ET:Asserted EC:OK

RID:0038 RT:02 TS:0B6A86C3 GID:0001 ER:04 ST:0F S#:06 ET:6F ED:A0 12 00 EX:F8 C2 DF B6 70 8E 20 AB

Mail 01/25/2022 19:59

Event that generated this alert:

RID:003A TS:01/26/1976 19:58:11 SN:POST Err Sensor ST:System Firmware Progress ED:System Firmware Error ET:Asserted EC:OK RID:003A RT:02 TS:0B6A86C3 GID:0001 ER:04 ST:0F S#:06 ET:6F ED:A0 20 52 EX:F8 C2 DF B6 70 8E 20 AB

 

Is there something helpful in it? 

And still VRD hot reads 0x8002 which points to CPU2: DIMM Channel 1+2 ( which is definitely wrong, as this sensor reads this value even before a second CPU and any memory  for CPU2 were added to the system.

 

Please, please help!

The only not validated part is memory now - but BIOS is happy - initialization goes fine, memory OK - so i doubt that this is caused by memory. Actually, i could not change that, as certified memory is nowhere available anymore. Sometimes a reseller has one maybe two modules of validated memory - but i would need 8 of same make and model - mission impossible as of now. 

 

Regards, Uli

0 Kudos
Wanner_G_Intel
Moderator
3,582 Views

Hello UlrichP,


Thank you for your response.


Please allow us to review the details you have shared with us. We will share an update soon.


Wanner G.

Intel Customer Support Technician


0 Kudos
Paul_R_Intel
Moderator
3,576 Views

Hello UlrichP,

 

Thank you for your patience and time, we are still analyzing the inquiry. To further investigate please provide a new set of Sysinfo logs and please share with us a photograph of the chassis opened to verify the configuration.

 

We would like to remind you that the compatible heatsinks are the following:

 

1.jpg

 

And we would like to know if you acquired the compatible Fans for Intel P4304XXMUXX Chassis, which are the following:

 

2.jpg

I will be waiting for your response

 

Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 

 

 

0 Kudos
UlrichP
Beginner
3,566 Views

Hi Paul ( hello again ;-),

 

well i just bought a fully equipped P4304XXMUXX, including the following:

(1) standard control panel (FXXFPANEL);
(5) Redundant and hot-swap 80mm system fans ( FUPMLHSFAN);
(4) fixed 3.5 inch drive sleds (FUP4X35NHDK);
(4) fixed power connectors;
(1) 4-port fan-out SATA cable;
(1) power supply cage (FUPCRPSCAGE);
(1) power distribution board (FUPPDBHC2);
(1) processor/memory airduct (A4UCWDUCT);
(1) hot-swap bezel with lock (FUPBEZELHSD2);
(1) MiniSAS HD to 4 ports SATA 7 pins cable (AXXCBL450HD7S)

Thus: fans do meet requirement.

 

Heat-Sinks used are ( as mentioned before 1x BXSTS300 °C and 1x AXXSTPHMKIT2U, both with CPU carrier clip AXXSTCPUCAR.

The older BXSTS300 °C is the predecessor of AXXSTPHMKIT2U and is working fine in two other S2600STBR systems - but we can go for a test with just one CPU and just the AXXSTPHMKIT2U heat sink.

 

Additonally, i bought a PSU FXX750PCRPS ( the newer - and more expensive - one #915604 )

 

And just to provide complete add-on list: I also have RMM4 ( AXXRMM4LITE2 ) and TPM 2.0 ( AXXTPMENC8 ) installed.

 

Please find attached a photograph of chassis opened (with airduct removed in order to see board, CPU, cabling etc.)

 

One thing we never came across: there is a small green LED on board co-located to BMC chip, that keeps blinking green in half-a-second frequency. (The one at the outer end of PCI-Slot #1 labled DS1B1 )

 

I also attached a sysinfo_log.txt and PCI_log.txt - as requested.

 

Now i will wait for your response...

 

Regards, Uli

0 Kudos
Paul_R_Intel
Moderator
3,545 Views

Hello UlrichP,


Thank you for all the information provided.


Please allow us to review the details you have shared with us. We will share an update soon.


Regards, 

 

Paul R.  

Intel Customer Support Technician  

For firmware updates and troubleshooting tips, visit:  

https://intel.com/support/serverbios  


0 Kudos
Paul_R_Intel
Moderator
3,535 Views

Hello UlrichP, 


I hope you are doing great, unfortunately, the Sysinfo logs didn't contain any human readable logs to analyze system current behavior, please provide the SEL log and a debug log to verify the 91 entries on the system in order to understand this amber light.

 

 Debug Logs:

  • These logs can be extracted by going to BMC Console > System Information > System Debug Log > Generate Log.


SEL logs:

  • These logs can be extracted by going to BMC Console > Server Health > Event Log > Save Event Log.


I will be waiting for your update.


Paul R. 

Intel Customer Support Technician 

For firmware updates and troubleshooting tips, visit: 

https://intel.com/support/serverbios 




0 Kudos
UlrichP
Beginner
3,529 Views

Hi Paul,

 

PFA DebugLogs and SELLOG.

Btw. i did not found BMC Console > System Information > System Debug Log > Generate Log.

But i found it under Server Diagnostics

 

Regards,

Uli

 

0 Kudos
Paul_R_Intel
Moderator
3,511 Views

Hello UlrichP,


Thank you for all the information provided.


Please allow us to analyze the logs. We will share an update soon.


Regards, 

 

Paul R.  

Intel Customer Support Technician  

For firmware updates and troubleshooting tips, visit:  

https://intel.com/support/serverbios  


0 Kudos
Reply