Solved: Intel Xeon Gold ADDDC Memory RAS Feature Clarification

Boomerang · ‎09-19-2020

I'm struggling to find any detailed information about how ADDDC memory RAS technique actually works. I'm specifically interested in the memory integrity guarantees that it provides.

Would ADDDC be able to cope with a DIMM device failing completely, without throwing correctable errors first (e.g. if it's snapped off the memory card)? Alternatively, maybe ADDDC acts like device sparing, followed by SDDC? In that case it cannot deal with a sudden DIMM device failure.

I struggle to understand how ADDDC would be able to achieve the same result as DDDC without any performance impact unless it lowers the protection guarantees...

Thanks in advance for any clarifications!

IntelSupport · ‎10-06-2020

Hello Boomerang

Thank you for waiting. Please see the information below:

Q1: Would it be correct to describe ADDDC as memory sparing followed by SDDC?

A1: When we transition from SDDC to ADDDC, a memory bank/rank gets mapped out, and the memory region that entered Virtual lockstep will be using ADDDC ECC code.

Q2: Since the memory operates in Performance mode, the words are not split between two channels, so not all errors within a single DIMM chip can be corrected through ECC. However, once a certain threshold of errors is reached, the memory layout changes to Lockstep mode and becomes redundant. Is that true?

A2: Yes

Q3: I've read the [whitepaper], but I can interpret it in two different ways, so it'd be great to hear it explained differently... The line that's giving me trouble is "where the identified failing region of the DRAM device is mapped out of ECC". How can the failing region be mapped out of ECC if the memory is in Performance mode?

A3: ADDDC enables the platform to dynamically map out the failing DRAM device. After map out occurs, cache lines in the bank/rank are re-arranged from independent mode to virtual lockstep utilizing ADDDC ECC.

Hope this helps.

Regards,

Leonardo C.

Intel Customer Support Technician

View solution in original post

SergioS_Intel · ‎09-20-2020

Hello Boomerang,

Thank you for contacting Intel Customer Support.

In regards to your question, with the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble.

You will be able to find more detailed information here:

https://software.intel.com/content/www/us/en/develop/articles/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html

Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios

Boomerang · ‎09-20-2020

Hello Sergio,

Thank you for the reply.

In that case would it be correct to describe ADDDC as memory sparing followed by SDDC?

Since the memory operates in Performance mode, the words are not split between two channels, so not all errors within a single DIMM chip can be corrected through ECC. However, once a certain threshold of errors is reached, the memory layout changes to Lockstep mode and becomes redundant. Is that true?

I've read the document you linked, but I can interpret it in two different ways, so it'd be great to hear it explained differently... The line that's giving me trouble is "where the identified failing region of the DRAM device is mapped out of ECC". How can the failing region be mapped out of ECC if the memory is in Performance mode?

SergioS_Intel · ‎09-21-2020

Hello Boomerang,

Please allow us to check on your question and we will get back to you as soon as possible.

Best regards,

Sergio S.

Intel Customer Support Technician

Boomerang · ‎09-24-2020

Hello Sergio,

Thank you. Looking forward to your reply.

Kind regards,

Boomerang

IntelSupport · ‎10-01-2020

Hello Boomerang

I would like to let you know that we are working on the investigation of your forum, thank you for waiting, in the meantime, I have sent you a private email to collect contact information.

Regards,

Leonardo C.

Intel Customer Support Technician

IntelSupport · ‎10-05-2020

Hello Boomerang

I am checking on this community, I would like to know if you received the private email that I send you to collect you contact details

Regards,

Leonardo C.

Intel Customer Support Technician

Boomerang · ‎10-06-2020

Hello Leonardo,

Just sent all the details.

Kind regards,

Boomerang

IntelSupport · ‎10-06-2020

Hello Boomerang

Thank you for waiting. Please see the information below:

Q1: Would it be correct to describe ADDDC as memory sparing followed by SDDC?

A1: When we transition from SDDC to ADDDC, a memory bank/rank gets mapped out, and the memory region that entered Virtual lockstep will be using ADDDC ECC code.

Q2: Since the memory operates in Performance mode, the words are not split between two channels, so not all errors within a single DIMM chip can be corrected through ECC. However, once a certain threshold of errors is reached, the memory layout changes to Lockstep mode and becomes redundant. Is that true?

A2: Yes

Q3: I've read the [whitepaper], but I can interpret it in two different ways, so it'd be great to hear it explained differently... The line that's giving me trouble is "where the identified failing region of the DRAM device is mapped out of ECC". How can the failing region be mapped out of ECC if the memory is in Performance mode?

A3: ADDDC enables the platform to dynamically map out the failing DRAM device. After map out occurs, cache lines in the bank/rank are re-arranged from independent mode to virtual lockstep utilizing ADDDC ECC.

Hope this helps.

Regards,

Leonardo C.

Intel Customer Support Technician

Boomerang · ‎10-08-2020

Hello Leonardo,

Thank you for the response. I had a misunderstanding that SDDC and ADDDC used a different method for error detection and correction (https://www.intel.com/content/dam/doc/application-note/e7500-chipset-mch-x4-single-device-data-correction-note.pdf) and I see now that this is a document for an old CPU.

Kind regards,

Boomerang

Intel Xeon Gold ADDDC Memory RAS Feature Clarification

Intel® Xeon® Processors