cancel
Showing results for 
Search instead for 
Did you mean: 

320 / 600 GB in Proliant DL380/G7 - shows as overheating

idata
Esteemed Contributor III

Just installed 5 of these in a RAID configuration on a test server. Working perfectly, except:

The server is reporting that the drives are overheating (they're not). It appears that possibly I can turn OFF DIPM on these drives and the SMART info will be reported correctly.

How do I do that?

FYI, running windows 2008 r2

Thanks,

Rob

46 REPLIES 46

idata
Esteemed Contributor III

I've talked to HP support in Russia and the answer was the same

May be there is any workaround to make the server work without forcing the fans to go full blast?

Because when I change Thermal Protection settings in BIOS, the server do not shutdown, but fans are kicked in the full speed.

idata
Esteemed Contributor III

I have 4 Intel 320 series 120GB drives with the same problem in a HP DL385G7. I also have several DL380G6's using 12 of the 160GB G2's.

The 320 series is supposed to be a direct replacement of the G2's, and indeed share the same firmware updates.

On the machines running the 160's, no issues. The second I pop the 320 series in, I get the fans spinning up, and the iLO begins to shut down the server because it thinks the drives are about to go nuclear.

Here's the problem:

The server in question has 8 bays. The first 4 have WD 600GB velociraptors in there. This is how they report temperature according to SMART.

Physical Port 1I

Physical Box 1 (0x01) Physical Bay 4 (0x04) RPM 10000 RPM (0x00002710) Device Type Sata (0x01) SATA Version 0x00 Big Total Block Count 0x0000000045dd2fb0 RIS Starting LBA 0x0000000045dc37b0 RIS Size 160 KB (0x00000140)......

Vendor Unique Inquiry Bytes All Zeroes (20 x [0x00])

Current Temperature 0x1c  Temperature Threshold 0x37  Maximum Temperature 0x2d

This is the same log file for the 320 series:

Physical Port 2I

Physical Box 1 (0x01) Physical Bay 5 (0x05) RPM 0 RPM (0x00000001) Device Type Sata Ssd (0x01) SATA Version 0x00 Big Total Block Count 0x000000000df94bb0 RIS Starting LBA 0x000000000df853b0 RIS Size 160 KB (0x00000140)......

Vendor Unique Inquiry Bytes All Zeroes (20 x [0x00])

Current Temperature 0xff  Temperature Threshold 0x37  Maximum Temperature 0x00 

So, Intel's 320 drive is reporting current temp as 0xff (max), and a max temp of 0x00 (min). If the previous poster was correct, and the drive simply reported 0x00 as it should for unsupported features, then there would be no problem. This is NOT what it is doing however. I've been waiting with an open ticket for a MONTH for a resolution on this. I don't care if it reports a fake number, I don't care if it reports 0x00. I just want to use the friggin $1000 worth of SSD's for my VM's like I'd planned on and budgeted for.

idata
Esteemed Contributor III

Aaron, your conclusion is incorrect in numerous regards, and it's because you don't understand how SMART attributes are stored nor are familiar with the protocol. It's understandable (most people aren't), and your frustration is justified.

First and foremost, as I've stated twice already, Intel SSDs do not have any support for SMART attribute 194. There is no thermistor. The SMART attribute isn't even listed/provided by the drive when it responds to ATA feature command 0xd0 (obtain SMART attributes and their data). So, simply put, there is no 0xff value for the iLO (or whatever; BIOS, firmware, etc. -- doesn't matter to me what it is) to obtain. Secondly, each SMART attribute consists of 6 bytes of data, not 1 byte.

Here is the SMART data structure that's returned from ATA feature command 0xd0. This is taken from my own SMART code for FreeBSD's atacontrol:

148 /*

149 * Obtain SMART attributes and their values150 *151 * Feature 0xd0 result (see ATA8-ACS, section 7.53.6.2, table 49):152 *153 * The 512-byte result of SMART READ DATA is documented per ATA8-ACS154 * specification, section 7.53.6.2, table 49. However, you'll find155 * bytes 0-361 marked "Vendor specific"; these are (mostly) the156 * actual SMART attributes themselves. Example:157 *158 * Offset Size (B) Description159 * -------- --------- -----------160 * 0 2 SMART attribute revision (16-bit, big endian)161 * 2 12 SMART attribute data entry # 0162 * 14 12 SMART attribute data entry # 1163 * .....164 * 348 12 SMART attribute data entry # 29165 * ..... 166 * -------- --------- -----------167 *168 * The SMART attribute data format is completely undocumented. It169 * consists of 12 bytes per attribute in the following format:170 *171 * Offset Size (B) Description172 * -------- --------- -----------173 * 0 1 Attribute ID number174 * 1 2 Attribute flags175 * 3 1 Attribute CURRENT value (adjusted)176 * 4 1 Attribute WORST value (adjusted)177 * 5 6 Attribute data178 * 11 1 179 * -------- --------- -----------180 */

Simply put, the iLO device *is not* getting back 0xff from the SSD itself. There is probably a broken piece of code in their iLO firmware or iLO BIOS which makes the assumption SMART attribute 194 exists, tries to refer to it in some in-memory buffer, and gets back whatever the contents of that buffer are (which haven't been populated/filled because there's nothing there). Lots of buffers, especially on embedded hardware (which the iLO is considered), are initialised with value 0xff rather than 0x00. There's a lot of reasons for this initialisation value, and the value chose IS NOT the bug. The bug is almost certainly that HP *assumes* all drives installed in their chassis support SMART attribute 194, and that is absolutely wrong/false/broken.

Your Western Digital WD6000BLHX or WD6000HLHX drives do have thermistors, and WD populates SMART attribute 194 with appropriate thermistor data. 

So again: please send all of your complaints/concerns to your support reps at HP, because this situation is exactly why you pay for support contracts. The more pressure you put on HP the better. There's absolutely nothing that requires a hard disk vendor (mechanical or SSD; doesn't matter which) to support SMART attribute 194, nor do they have to install a thermistor in their drives at all. HP therefore is making a very bad assumption, and the only persons suffering from it are their customers.

idata
Esteemed Contributor III

Just a litte more informations about temperature and Intel SSD.

We both have hold G2 and 320 SSD drives from intel. The G2 are working fine on a HP DL380G6. We have problems with the 320 on DL380G7.

But the way both drives manage the temperature is different.

On the G2 :

smartctl -i -c -l scttempsts -d sat+cciss,0 /dev/cciss/c0d0smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net http://smartmontools.sourceforge.net === START OF INFORMATION SECTION ===Model Family: Intel X18-M/X25-M/X25-V G2 SSDsDevice Model: INTEL SSDSA2M160G2GCSerial Number: CVPO0073011D160AGNFirmware Version: 2CV102HDUser Capacity: 160,041,885,696 bytesDevice is: In smartctl database [for details use: -P show]ATA Version is: 7ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1Local Time is: Fri Jun 24 18:03:29 2011 CEST ==> WARNING: This drive may require a firmware update tofix possible drive hangs when reading SMART self-test log:http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=18363 http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=18363 SMART support is: Available - device has SMART capability.SMART support is: Enabled === START OF READ SMART DATA SECTION ===Warning: device does not support SCT Commands On the new 320 :smartctl -i -c -l scttempsts -d sat+cciss,0 /dev/cciss/c0d0smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net http://smartmontools.sourceforge.net=== START OF INFORMATION SECTION ===Model Family: Intel 320 Series SSDsDevice Model: INTEL SSDSA2CW300G3Serial Number: CVPR12140100300EGNFirmware Version: 4PC10302User Capacity: 300,069,052,416 bytesDevice is: In smartctl database [for details use: -P show]ATA Version is: 8ATA Standard is: ATA-8-ACS revision 4Local Time is: Fri Jun 24 18:03:05 2011 CESTSMART support is: Available - device has SMART capability.SMART support is: Enabled=== START OF READ SMART DATA SECTION ===...SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported....SCT Status Version: 3SCT Version (vendor specific): 1 (0x0001)SCT Support Level: 0Device State: Active (0)Current Temperature: ? CelsiusPower Cycle Min/Max Temperature: ?/ ? CelsiusLifetime Min/Max Temperature: ?/ ?...

idata
Esteemed Contributor III

I'm in agreement with Fabrice's analysis here. Fabrice, please make sure you're using smartmontools 5.41 or newer, as SSD support has greatly improved in that version. Regardless, I hadn't taken the time (nor did I notice until now!) that SCT on the X18/X25 series isn't available, while on the 320 series (and possibly 510) the drives do claim to support SCT -- yet do not.

An example on a 320-series drive:

# smartctl -a /dev/ada0

...

=== START OF INFORMATION SECTION ===

Model Family: Intel 320 Series SSDsDevice Model: INTEL SSDSA2CW080G3Serial Number: XLU WWN Device Id: XFirmware Version: 4PC10302User Capacity: 80,026,361,856 bytes [80.0 GB]Sector Size: 512 bytes logical/physicalDevice is: In smartctl database [for details use: -P show]ATA Version is: 8ATA Standard is: ATA-8-ACS revision 4Local Time is: Sat Jun 25 08:43:10 2011 PDTSMART support is: Available - device has SMART capability.SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled.Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run.Total time to complete Offlinedata collection: ( 1) seconds.Offline data collectioncapabilities: (0x71) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported.SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer.Error logging capability: (0x01) Error logging supported. General Purpose Logging supported.Short self-test routinerecommended polling time: ( 1) minutes.Extended self-test routinerecommended polling time: ( 1) minutes.Conveyance self-test routinerecommended polling time: ( 1) minutes.SCT capabilities: (0x003d) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported.

Note that the drive is indeed claiming SCT is supported, which is absolutely not possible given the lack of thermistor on these drives. Further validation, including the SCT table itself, is confirmed using the -x flag to smartctl:

# smartctl -x /dev/ada0

...

SMART Log Directory Version 1 [multi-sector log support]

...

SMART Log at address 0xe0 has 1 sectors [SCT Command/Status]

SMART Log at address 0xe1 has 1 sectors [SCT Data Transfer]

...

(pass0:ahcich0:0:0:0): SMART. ACB: b0 d5 e1 4f c2 40 00 00 00 00 01 00

(pass0:ahcich0:0:0:0): CAM status: ATA Status Error(pass0:ahcich0:0:0:0): ATA status: 51 (DRDY SERV ERR), error: 04 (ABRT )(pass0:ahcich0:0:0:0): RES: 51 04 00 00 00 40 00 00 00 11 00Error Read SCT Data Table failed: No error: 0SCT Error Recovery Control: Read: 57344 (5734.4 seconds) Write: 57344 (5734.4 seconds)

The CAM-translated ATA errors you see above are coming from the FreeBSD kernel, indicating the drive doesn't support that particular ATA command byte. Which byte? Well, I'm guessing attempting to look up SCT table 0xe1, which will be proven below. So let's see what SMART log offset 0xe0 and 0xe1 contain, if anything:

# smartctl -l smartlog,0xe0 /dev/ada0

smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-STABLE amd64] (local build)Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net http://smartmontools.sourceforge.net

SMART Log 0xe0 [SCT Command/Status], Page 0-0 (of 1)

0000000: 03 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|0000010: 03 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|0000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|0000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|0000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|0000050: 00 00 00 00 00 00 00...