Re:Recommended Utility for Monitoring Intel® VROC RAID Status on Linux (Red Hat and Debian)

Samadhan · ‎12-17-2024

Hello Intel Support,

We are using Intel Xeon 5th Gen machines with Red Hat Enterprise Linux as the host OS and Debian running as a guest in a KVM virtual machine. Intel® VROC is available on this Intel Xeon Family.

I have reviewed the documentation but could not find a dedicated Intel software RAID utility for Linux (Red Hat). Currently, I am using mdadm (Linux software RAID utility), but it does not provide comprehensive status updates for Intel® VROC RAID.

[root@localhost ~]# mdadm --detail-platform
mdadm: imsm capabilities not found for controller: /sys/devices/pci0000:00/0000:00:17.0 (type SATA)
mdadm: imsm capabilities not found for controller: /sys/devices/pci0000:00/0000:00:19.0 (type SATA)
Platform : Intel(R) Virtual RAID on CPU
Version : 8.5.0.1096
RAID Levels : raid0 raid1 raid10
Chunk Sizes : 4k 8k 16k 32k 64k 128k
2TB volumes : supported
2TB disks : supported
Max Disks : 96
Max Volumes : 2 per array, 24 per controller
3rd party NVMe : supported
I/O Controller : /sys/devices/pci0000:9a/0000:9a:00.5 (VMD)
NVMe under VMD : /dev/nvme0n1 (S64FNN0X507593)
Encryption(Ability|Status): SED|Unencrypted
NVMe under VMD : /dev/nvme2n1 (S64FNN0X507591)
Encryption(Ability|Status): SED|Unencrypted
NVMe under VMD : /dev/nvme1n1 (S64FNN0X507583)
Encryption(Ability|Status): SED|Unencrypted
I/O Controller : /sys/devices/pci0000:c6/0000:c6:00.5 (VMD)

[root@localhost ~]# sudo mdadm --detail /dev/md126
/dev/md126:
Container : /dev/md/imsm, member 0
Raid Level : raid1
Array Size : 890806272 (849.54 GiB 912.19 GB)
Used Dev Size : 890806272 (849.54 GiB 912.19 GB)
Raid Devices : 2
Total Devices : 2

State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0

Consistency Policy : resync
UUID : 991e034f:faaee5eb:fc8c7c50:8482898a
Number Major Minor RaidDevice State
1 259 0 0 active sync /dev/nvme0n1
0 259 2 1 active sync /dev/nvme1n1

I am specifically looking to monitor the following RAID statuses:

FailSpare: The spare drive being rebuilt has failed.
DeviceDisappeared: A RAID volume disappeared or was removed.
DegradedArray: A RAID array is running in degraded mode.
RebuildStarted: Rebuilding or recovery of a degraded RAID has started.
RebuildNN: Notification of rebuild progress (e.g., 20%, 40%).
RebuildFinished: RAID rebuild completed or aborted.
SparesMissing: One or more spare drives are missing.

Could you please suggest the recommended Intel utility or solution to monitor and retrieve these details on Red Hat?

Thank you for your assistance!

Ragulan_Intel · ‎12-17-2024

Hello Samadhan,

I hope this message finds you well.

Thank you for reaching out to the Intel Community!

Regarding your query, please refer to the following link: https://www.intel.com/content/dam/support/us/en/documents/memory-and-storage/linux-intel-vroc-userguide-333915.pdf. In Section 2, you will find the necessary mdadm commands that might be helpful.

Additionally, here is the link to the VROC user guide that provides all the mdadm commands you need: https://www.intel.com/content/dam/support/us/en/documents/memory-and-storage/linux-intel-vroc-userguide-333915.pdf

We hope this information is helpful.

Thank You & Best regards,

Ragulan_Intel

Samadhan · ‎12-23-2024

Hi @Ragulan_Intel

Thank you for your reply.

I have been referring to the User Guide for Intel® Virtual RAID on CPU (Intel® VROC) for Linux, particularly the following sections:

1. Intel® VROC RAID Management in Linux
- 4.3.2 Retrieve RAID Status through /proc/mdstat
- 4.3.3 Extracting detailed RAID information
- 4.3.4 Reading Intel® VROC RAID Metadata
5.3 Intel® VROC RAID Alerts in Linux
5.4 Develop a Program to Handle Intel® VROC Alerts

From the documentation, I understand that the mdadm utility is used for managing and monitoring software RAID devices in Linux.
For my task, I have identified the following useful commands:

cat /proc/mdstat
- This command helps identify the active RAID devices.
- For example, based on the output from cat /proc/mdstat, I am seeing different RAID arrays with various states like "active" and "inactive."
- So here I will just grep active raid device to check its status, etc.

[root@localhost ~]# cat /proc/mdstat
Personalities : [raid1]
md125 : inactive nvme2n1[0](S)
      1105 blocks super external:imsm
md126 : active raid1 nvme0n1[1] nvme1n1[0]
      890806272 blocks super external:/md127/0 [2/2] [UU]
md127 : inactive nvme1n1[1](S) nvme0n1[0](S)
      10402 blocks super external:imsm
unused devices: <none>

mdadm --detail /dev/md126
- This command provides detailed RAID volume information, such as :
- For example, based on the output from mdadm --detail /dev/md126, I am seeing the state. Is it active? It means RAID controller status/health is active or OK.
- RAID arrays/devices with various states like active sync mean RAID volumes are in active and sync

[root@localhost ~]# mdadm --detail /dev/md126
/dev/md126:
         Container : /dev/md/imsm, member 0
        Raid Level : raid1
        Array Size : 890806272 (849.54 GiB 912.19 GB)
     Used Dev Size : 890806272 (849.54 GiB 912.19 GB)
      Raid Devices : 2
     Total Devices : 2
             State : active
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
Consistency Policy : resync
              UUID : 991e034f:faaee5eb:fc8c7c50:8482898a
    Number   Major   Minor   RaidDevice State
       1     259        0        0      active sync   /dev/nvme0n1
       0     259        2        1      active sync   /dev/nvme1n1

I am currently designing a shell script to monitor specific RAID statuses, and for this purpose, I need to understand the possible RAID status values beyond just "active" and "active sync."

Specifically, I need to confirm what the possible statuses are in cases of failure, degradation, missing drives, and offline devices.

My main questions are:

If a hard disk is missing:
- What will be the output and RAID status in the `mdadm --detail /dev/md126` and `cat /proc/mdstat` commands?
- Specifically, how will the RAID array reflect the missing disk's status?
If a hard disk is offline:
- How will this affect the RAID status shown by these commands? What would the status be (e.g., `offlinesyncing`)?
If a RAID array is degraded:
- What status will be reflected in the output?
- Specifically, how does "degraded" appear in terms of RAID state in both `mdadm` output and the `/proc/mdstat` file?
If a RAID is in a FailSpare state:
- What is the output in `mdadm --detail`?
- How is this state reflected in terms of the status of the RAID array and the individual drives?

I am not fully aware of all possible RAID status values, so it would be very helpful if you could provide more details on other RAID status values as well.

I have found the following alerts/statuses in the manual:
- Critical Severity : (Fail, FailSpare, DeviceDisappeared, DegradedArray)
- Warning Severity : (RebuildStarted, RebuildNN, RebuildFinished, SparesMissing)

My goal is to monitor only the Critical/Failure and Warning statuses. Could you please explain what the status would be for the above severity types?

Additionally:
Is it better to use the `mdadm --detail /dev/md<number>` command to check the complete RAID status, or should I use `mdadm --examine /dev/<nvme>` on any RAID member disk?
For example:

[root@localhost ~]# blkid
/dev/mapper/rhel-swap: UUID="320f9b91-f1e8-4b95-962e-447bce6a5a41" TYPE="swap"
/dev/nvme0n1: TYPE="isw_raid_member"
/dev/nvme2n1: TYPE="isw_raid_member"
/dev/mapper/rhel-home: UUID="faff47ee-0fe9-4e7b-9c0e-644f8bf795c6" TYPE="xfs"
/dev/mapper/rhel-root: UUID="f5aa3497-1fc6-43a1-b4b4-be64c30a56da" TYPE="xfs"
/dev/nvme1n1: TYPE="isw_raid_member"

[root@localhost ~]# mdadm --examine /dev/nvme1n1
/dev/nvme1n1:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : ba9034a3
Family : ba9034a3
Generation : 0002423c
Creation Time : Mon Nov 11 16:01:35 2024
Attributes : 80000000 (supported)
UUID : 929b5f70:e88572b1:2fa10274:3ec134c4
Checksum : 2512605a correct
MPB Sectors : 1
Disks : 2
RAID Devices : 1

Disk01 Serial : S64FNN0X507583
State : active
Id : 00000001
Usable Size : 1875374606 (894.25 GiB 960.19 GB)

[Volume0]:
Subarray : 0
UUID : 991e034f:faaee5eb:fc8c7c50:8482898a
RAID Level : 1
Members : 2
Slots : [UU]
Failed disk : none
This Slot : 1
Sector Size : 512
Array Size : 1781612544 (849.54 GiB 912.19 GB)
Per Dev Size : 1781614592 (849.54 GiB 912.19 GB)
Sector Offset : 0
Num Stripes : 6959424
Chunk Size : 64 KiB
Reserved : 0
Migrate State : idle
Map State : normal
Dirty State : dirty
RWH Policy : off
Volume ID : 1
Disk00 Serial : S64FNN0X507593
State : active
Id : 00000000
Usable Size : 1875374606 (894.25 GiB 960.19 GB)

[root@localhost ~]# mdadm --examine /dev/nvme2n1
/dev/nvme2n1:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.0.00
Orig Family : c11cc5b9
Family : c11cc5b9
Generation : 00000000
Creation Time : Mon Nov 11 16:01:35 2024
Attributes : 80000000 (supported)
UUID : 00000000:00000000:00000000:00000000
Checksum : a155a7e5 correct
MPB Sectors : 1
Disks : 1
RAID Devices : 0

Disk00 Serial : S64FNN0X507591
State : spare
Id : 00000002
Usable Size : 1875382798 (894.25 GiB 960.20 GB)

Disk Serial : S64FNN0X507591
State : spare
Id : 00000002
Usable Size : 1875382798 (894.25 GiB 960.20 GB)

[root@localhost ~]# mdadm --examine /dev/nvme1n1
/dev/nvme1n1:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : ba9034a3
Family : ba9034a3
Generation : 0002423c
Creation Time : Mon Nov 11 16:01:35 2024
Attributes : 80000000 (supported)
UUID : 929b5f70:e88572b1:2fa10274:3ec134c4
Checksum : 2512605a correct
MPB Sectors : 1
Disks : 2
RAID Devices : 1

Disk01 Serial : S64FNN0X507583
State : active
Id : 00000001
Usable Size : 1875374606 (894.25 GiB 960.19 GB)

[Volume0]:
Subarray : 0
UUID : 991e034f:faaee5eb:fc8c7c50:8482898a
RAID Level : 1
Members : 2
Slots : [UU]
Failed disk : none
This Slot : 1
Sector Size : 512
Array Size : 1781612544 (849.54 GiB 912.19 GB)
Per Dev Size : 1781614592 (849.54 GiB 912.19 GB)
Sector Offset : 0
Num Stripes : 6959424
Chunk Size : 64 KiB
Reserved : 0
Migrate State : idle
Map State : normal
Dirty State : dirty
RWH Policy : off
Volume ID : 1

Disk00 Serial : S64FNN0X507593
State : active
Id : 00000000
Usable Size : 1875374606 (894.25 GiB 960.19 GB)

Sorry for my lack of experience with this, but I am new to it, and your support would be greatly helpful to me.

Thank you for your time and assistance. I look forward to your clarification.

Best regards,
Samadhan

ManoranjanDas · ‎12-23-2024

Hello Samadhan,

Greetings for the day!

Thank you for sharing this information. We are actively working on this case and will share updates as soon as possible.

Thank you for using Intel products and services.

Regards,

Manoranjan.

Intel Customer Support Technician

Intel.com/vroc

Amina_Sadiya · ‎12-30-2024

Hello Team,

Greetings from Intel!

Thank you for your patience, Regarding your query I request you to kindly look into the link below for information related to Intel® Virtual RAID on CPU (Intel® VROC) User Guide for Linux.

https://www.intel.com/content/www/us/en/support/articles/000094694/memory-and-storage/datacenter-storage-solutions.html

Regards,

Amina

Intel Customer Support Technician

Intel.com/vroc

Amina_Sadiya · ‎12-31-2024

Hello Team,

Thank you for your patience. Regarding your query, I kindly request that you refer to the article linked below. Please navigate to the bottom of the page, where you will find a link to the Intel® Virtual RAID on CPU (Intel® VROC) User Guide for Linux* (PDF). The information you are looking for is covered in this User Guide. Specifically, I recommend reviewing section 5.4, Develop a Program to Handle Intel® VROC Alerts, which includes an example of how to create a custom script to monitor the VROC RAID status. This section will provide details on the states you need to monitor for the RAID volumes.

https://www.intel.com/content/www/us/en/support/articles/000094694/memory-and-storage/datacenter-storage-solutions.html

Regards,

Amina

Intel Customer Support Technician

Intel.com/vroc

Ragulan_Intel · ‎01-04-2025

Hello Samadhan,

This is a gentle follow up to see if you have any further concerns. If we are good, we would like to obtain your approval to archive this case.

Thank You & Best Regards,

Ragulan_Intel

Ragulan_Intel · ‎01-06-2025

Hello Samadhan,

I hope this message finds you well.

Kindly note that as we have not received any response from our previous follow-ups, we will be closing this thread, but other community members can continue to contribute on this thread.

You may receive an invitation to take a survey in a few days. We value your feedback and look forward to hearing about your support experience.

Thank You & Best Regards,

Ragulan_Intel