Processors
Intel® Processors, Tools, and Utilities
16039 Discussions

Intel Xeon Gold ADDDC Memory RAS Feature Clarification

Boomerang
Beginner
13,369 Views

I'm struggling to find any detailed information about how ADDDC memory RAS technique actually works. I'm specifically interested in the memory integrity guarantees that it provides.

 

Would ADDDC be able to cope with a DIMM device failing completely, without throwing correctable errors first (e.g. if it's snapped off the memory card)? Alternatively, maybe ADDDC acts like device sparing, followed by SDDC? In that case it cannot deal with a sudden DIMM device failure.

 

I struggle to understand how ADDDC would be able to achieve the same result as DDDC without any performance impact unless it lowers the protection guarantees...

 

Thanks in advance for any clarifications!

Labels (1)
0 Kudos
1 Solution
IntelSupport
Community Manager
13,274 Views

Hello Boomerang


Thank you for waiting. Please see the information below:


Q1: Would it be correct to describe ADDDC as memory sparing followed by SDDC?


A1: When we transition from SDDC to ADDDC, a memory bank/rank gets mapped out, and the memory region that entered Virtual lockstep will be using ADDDC ECC code.


Q2: Since the memory operates in Performance mode, the words are not split between two channels, so not all errors within a single DIMM chip can be corrected through ECC. However, once a certain threshold of errors is reached, the memory layout changes to Lockstep mode and becomes redundant. Is that true?

A2: Yes


Q3: I've read the [whitepaper], but I can interpret it in two different ways, so it'd be great to hear it explained differently... The line that's giving me trouble is "where the identified failing region of the DRAM device is mapped out of ECC". How can the failing region be mapped out of ECC if the memory is in Performance mode?

A3: ADDDC enables the platform to dynamically map out the failing DRAM device. After map out occurs, cache lines in the bank/rank are re-arranged from independent mode to virtual lockstep utilizing ADDDC ECC.


Hope this helps.


Regards,

Leonardo C.


Intel Customer Support Technician


View solution in original post

0 Kudos
9 Replies
SergioS_Intel
Moderator
13,356 Views

Hello Boomerang,


Thank you for contacting Intel Customer Support.

 

In regards to your question, with the advent of ADDDC, the memory subsystem is always configured to operate in performance mode. When the number of corrections on a DRAM device reaches the targeted threshold value, with help from the UEFI runtime code, the identified failing DRAM region is adaptively placed in lockstep mode where the identified failing region of the DRAM device is mapped out of ECC. Once in ADDDC, cache line ECC continues to cover single DRAM (x4) error detection and apply a correction algorithm to the nibble.


You will be able to find more detailed information here:


https://software.intel.com/content/www/us/en/develop/articles/new-reliability-availability-and-serviceability-ras-features-in-the-intel-xeon-processor.html



Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios


0 Kudos
Boomerang
Beginner
13,352 Views

Hello Sergio,

 

Thank you for the reply.

In that case would it be correct to describe ADDDC as memory sparing followed by SDDC?

Since the memory operates in Performance mode, the words are not split between two channels, so not all errors within a single DIMM chip can be corrected through ECC. However, once a certain threshold of errors is reached, the memory layout changes to Lockstep mode and becomes redundant. Is that true?

I've read the document you linked, but I can interpret it in two different ways, so it'd be great to hear it explained differently... The line that's giving me trouble is "where the identified failing region of the DRAM device is mapped out of ECC". How can the failing region be mapped out of ECC if the memory is in Performance mode?

0 Kudos
SergioS_Intel
Moderator
13,342 Views

Hello Boomerang,


Please allow us to check on your question and we will get back to you as soon as possible.


Best regards,

Sergio S.

Intel Customer Support Technician



0 Kudos
Boomerang
Beginner
13,327 Views

Hello Sergio,

 

Thank you. Looking forward to your reply.

 

Kind regards,

Boomerang

0 Kudos
IntelSupport
Community Manager
13,307 Views

Hello Boomerang


I would like to let you know that we are working on the investigation of your forum, thank you for waiting, in the meantime, I have sent you a private email to collect contact information.


Regards,

Leonardo C.


Intel Customer Support Technician


0 Kudos
IntelSupport
Community Manager
13,298 Views

Hello Boomerang


I am checking on this community, I would like to know if you received the private email that I send you to collect you contact details


Regards,

Leonardo C.


Intel Customer Support Technician


0 Kudos
Boomerang
Beginner
13,280 Views

Hello Leonardo,

 

Just sent all the details.

 

Kind regards,

Boomerang

0 Kudos
IntelSupport
Community Manager
13,275 Views

Hello Boomerang


Thank you for waiting. Please see the information below:


Q1: Would it be correct to describe ADDDC as memory sparing followed by SDDC?


A1: When we transition from SDDC to ADDDC, a memory bank/rank gets mapped out, and the memory region that entered Virtual lockstep will be using ADDDC ECC code.


Q2: Since the memory operates in Performance mode, the words are not split between two channels, so not all errors within a single DIMM chip can be corrected through ECC. However, once a certain threshold of errors is reached, the memory layout changes to Lockstep mode and becomes redundant. Is that true?

A2: Yes


Q3: I've read the [whitepaper], but I can interpret it in two different ways, so it'd be great to hear it explained differently... The line that's giving me trouble is "where the identified failing region of the DRAM device is mapped out of ECC". How can the failing region be mapped out of ECC if the memory is in Performance mode?

A3: ADDDC enables the platform to dynamically map out the failing DRAM device. After map out occurs, cache lines in the bank/rank are re-arranged from independent mode to virtual lockstep utilizing ADDDC ECC.


Hope this helps.


Regards,

Leonardo C.


Intel Customer Support Technician


0 Kudos
Boomerang
Beginner
13,260 Views

Hello Leonardo,

 

Thank you for the response. I had a misunderstanding that SDDC and ADDDC used a different method for error detection and correction (https://www.intel.com/content/dam/doc/application-note/e7500-chipset-mch-x4-single-device-data-correction-note.pdf) and I see now that this is a document for an old CPU.

 

Kind regards,

Boomerang

0 Kudos
Reply