Solved: how to integrate two DCPMM regions into one?

huangwentao · ‎03-02-2021

Hi all,

I have a machine with 2 sockets, 24 DIMMs. I have 8 128GB DCPMM modules and each socket is equipped with 4 DCPMM, so each socket NVM size is 512GB. I would like to configure all these modules into App Direct Mode. I follow the official guide to configuring the machine:

ndctl create-namespace --region region0 -–mode fsdax

ndctl create-namespace --region region1 -–mode fsdax

mkfs.ext4 /dev/pmem0

mkfs.ext4 /dev/pmem1

mount -o dax /dev/pmem0 /pmemfs0

mount -o dax /dev/pmem1 /pmemfs1

So now I have two mounting points with 512GB each. Considering 512GB size, it is not very large. Therefore, I am wondering if there is a way to integrate these two mounting points into one?

Many thanks for the help.

AdrianM_Intel · ‎03-02-2021

Hello huangwentao,

I was investigating more about your question and here is our answer:

Base on your system configuration:

One Server System with two CPUs and 24 DIMMs.

(8 x 128GB DCPMM modules / each socket equipped with 4 DCPMM).

Your goal is to unified both mounting points with 512GB each.

If you use this setup with two mount options, you may have a NUMA advantage if your Db/App understands and supports the NUMA advantage; However, it may cause some performance degradation if you configure a single mount point with no NUMA access.

Nevertheless, there are several ways to accomplish this implementation:

1. Creating dm-linear Devices:

Device-Mapper’s “linear” target maps a linear range of the Device-Mapper device onto a linear range of another device. For example, if two 512GiB devices are linearly mapped, the resulting virtual device is 1TiB.

Note: If the physical devices have already been configured within interleaved sets, dm-stripe devices could potentially stripe across Non-Uniform Memory Architecture Nodes (NUMA Nodes).

For this example, two pmem devices will be used to create a larger mapped device

2. Creating dm-striped Devices:

Device-Mapper’s “striped” target is used to create a striped (i.e. RAID-0) device across one or more underlying devices. Data is written in “chunks”, with consecutive chunks rotating among the underlying devices. The “chunk” size should match the page size discussed in the “IO Alignment Considerations” section above. This can potentially provide improved I/O throughput by utilizing several physical devices in parallel.

Note: If the physical devices have already been configured within interleaved sets, dm-stripe devices could potentially stripe across Non-Uniform Memory Architecture Nodes (NUMA Nodes).

If the HugePage size (2 MiB) is used as the ‘chunk size’, it’ll end up using PMDs for optimal efficiency and performance. Unlike dm-raid*, dm-striped doesn’t have an option for a separate metadata device, so the alignment will always work out.

For more details, please check the following link: pmem.io: Using Persistent Memory Devices with the Linux Device Mapper

So, your choices are to have one interleave set per socket (the system will not interleave persistent memory across sockets) or turn off interleaving completely and put each memory module in its own set. Additionally, the Linux term "region" is really the same thing as what we call "interleave set" at the HW level. They always match up 1:1. An interleave set (or region) can be sub-divided into namespaces. The namespaces cannot be used to combine multiple interleave sets together. To do that, you need to either use the device-mapper or a library like PMDK, which connects them via the "poolset" feature where you create a PMDK pool from multiple files.

Let’s say that Software RAID is one option but may not be the 'best' option. It depends on the application requirements. 'Best' is not necessarily the 'optimal' solution. The 'optimal' solution is usually not to use SW RAID. This includes Device Mapper, mdadm, or LVM. Note that only the stripe and linear DeviceMapper code has DAX enablement. So only 4- or 8-socket systems, doing more complex RAID levels will not work with DAX.

Similar to the SW RAID solution are 'poolsets'. PMDK supports persistent memory pool sets, where a larger pool can be created from two or more smaller pools. Specifically, those smaller pools may be created on different NUMA nodes/CPU Sockets. The following example poolset config file creates a 400GiB memory pool using 2 x 200GiB files, one from CPU0 and one from CPU1 (assumes you have created the regions, namespaces, and mounted them on /pmemfs0 and /pmemfs1). At this time, we do lose NUMA locality information as PMDK will concatenate the smaller pools so you can lose performance depending on which CPU the thread is running when accessing the data.

PMEMPOOLSET
OPTION NOHDRS
200G /pmemfs0/myfile.part0
200G /pmemfs1/myfile.part1

See the poolset(5) man page for more info.

Intel platforms are NUMA (Non-Uniform Memory Architecture). To have CPUs access any address with the same latency requires UMA (Uniform Memory Architecture).

Intel CPUs have an integrated memory controller (IMC) that manages memory physically located on that CPU sockets. If a request occurs on a CPU that is out of range, the CPU has a directory it can lookup to forward the request to the CPU that does manage the requested memory address. The request then goes over the UPI interface to satisfy the request. To learn more about how the UPI and directory updates work, take a look at these videos:

When you introduce RAID, it loses the NUMA locality information. For this reason, it is recommended to make the application NUMA aware, if not already, meaning it can memory map files (persistent memory pools) from any of the CPU sockets on the host and ensure threads run on the appropriate CPU. There are several techniques to achieve this including, but not limited to, some of the following:

The app creates pools of worker threads to handle the data requests. Each thread pool is assigned to a specific CPU socket using a local persistent memory pool
The app assigns specific data structures to specific persistent memory pools

In summary, apps should be NUMA aware and designed to access data optimally by running threads on the CPUs closest to the data they need access (read or write).

Regards,

Adrian M.

Intel Customer Support Technician

View solution in original post

AdrianM_Intel · ‎03-02-2021

Hello huangwentao,

Thank you for posting on the Intel® communities.

To better assist you, could you please let us know the model of the motherboard that you are using?

Regards,

Adrian M.

Intel Customer Support Technician

AdrianM_Intel · ‎03-02-2021

Hello huangwentao,

I was investigating more about your question and here is our answer:

Base on your system configuration:

One Server System with two CPUs and 24 DIMMs.

(8 x 128GB DCPMM modules / each socket equipped with 4 DCPMM).

Your goal is to unified both mounting points with 512GB each.

If you use this setup with two mount options, you may have a NUMA advantage if your Db/App understands and supports the NUMA advantage; However, it may cause some performance degradation if you configure a single mount point with no NUMA access.

Nevertheless, there are several ways to accomplish this implementation:

1. Creating dm-linear Devices:

Device-Mapper’s “linear” target maps a linear range of the Device-Mapper device onto a linear range of another device. For example, if two 512GiB devices are linearly mapped, the resulting virtual device is 1TiB.

Note: If the physical devices have already been configured within interleaved sets, dm-stripe devices could potentially stripe across Non-Uniform Memory Architecture Nodes (NUMA Nodes).

For this example, two pmem devices will be used to create a larger mapped device

2. Creating dm-striped Devices:

Device-Mapper’s “striped” target is used to create a striped (i.e. RAID-0) device across one or more underlying devices. Data is written in “chunks”, with consecutive chunks rotating among the underlying devices. The “chunk” size should match the page size discussed in the “IO Alignment Considerations” section above. This can potentially provide improved I/O throughput by utilizing several physical devices in parallel.

Note: If the physical devices have already been configured within interleaved sets, dm-stripe devices could potentially stripe across Non-Uniform Memory Architecture Nodes (NUMA Nodes).

If the HugePage size (2 MiB) is used as the ‘chunk size’, it’ll end up using PMDs for optimal efficiency and performance. Unlike dm-raid*, dm-striped doesn’t have an option for a separate metadata device, so the alignment will always work out.

For more details, please check the following link: pmem.io: Using Persistent Memory Devices with the Linux Device Mapper

So, your choices are to have one interleave set per socket (the system will not interleave persistent memory across sockets) or turn off interleaving completely and put each memory module in its own set. Additionally, the Linux term "region" is really the same thing as what we call "interleave set" at the HW level. They always match up 1:1. An interleave set (or region) can be sub-divided into namespaces. The namespaces cannot be used to combine multiple interleave sets together. To do that, you need to either use the device-mapper or a library like PMDK, which connects them via the "poolset" feature where you create a PMDK pool from multiple files.

Let’s say that Software RAID is one option but may not be the 'best' option. It depends on the application requirements. 'Best' is not necessarily the 'optimal' solution. The 'optimal' solution is usually not to use SW RAID. This includes Device Mapper, mdadm, or LVM. Note that only the stripe and linear DeviceMapper code has DAX enablement. So only 4- or 8-socket systems, doing more complex RAID levels will not work with DAX.

Similar to the SW RAID solution are 'poolsets'. PMDK supports persistent memory pool sets, where a larger pool can be created from two or more smaller pools. Specifically, those smaller pools may be created on different NUMA nodes/CPU Sockets. The following example poolset config file creates a 400GiB memory pool using 2 x 200GiB files, one from CPU0 and one from CPU1 (assumes you have created the regions, namespaces, and mounted them on /pmemfs0 and /pmemfs1). At this time, we do lose NUMA locality information as PMDK will concatenate the smaller pools so you can lose performance depending on which CPU the thread is running when accessing the data.

PMEMPOOLSET
OPTION NOHDRS
200G /pmemfs0/myfile.part0
200G /pmemfs1/myfile.part1

See the poolset(5) man page for more info.

Intel platforms are NUMA (Non-Uniform Memory Architecture). To have CPUs access any address with the same latency requires UMA (Uniform Memory Architecture).

Intel CPUs have an integrated memory controller (IMC) that manages memory physically located on that CPU sockets. If a request occurs on a CPU that is out of range, the CPU has a directory it can lookup to forward the request to the CPU that does manage the requested memory address. The request then goes over the UPI interface to satisfy the request. To learn more about how the UPI and directory updates work, take a look at these videos:

When you introduce RAID, it loses the NUMA locality information. For this reason, it is recommended to make the application NUMA aware, if not already, meaning it can memory map files (persistent memory pools) from any of the CPU sockets on the host and ensure threads run on the appropriate CPU. There are several techniques to achieve this including, but not limited to, some of the following:

The app creates pools of worker threads to handle the data requests. Each thread pool is assigned to a specific CPU socket using a local persistent memory pool
The app assigns specific data structures to specific persistent memory pools

In summary, apps should be NUMA aware and designed to access data optimally by running threads on the CPUs closest to the data they need access (read or write).

Regards,

Adrian M.

Intel Customer Support Technician

huangwentao · ‎03-03-2021

Many thanks for the detailed reply, Adrian. It helps me a lot.

AdrianM_Intel · ‎03-03-2021

Hello huangwentao,

Thank you for your response.

I am glad to know that the answer helped you a lot, let me know if you have more questions or if we can close this thread?

Regards,

Adrian M.

Intel Customer Support Technician

AdrianM_Intel · ‎03-05-2021

Hello huangwentao,

Were you able to check the previous post?

Let me know if you need more assistance.

Regards,

Adrian M.

Intel Customer Support Technician

huangwentao · ‎03-05-2021

Thank you Adrian, will accept your answer as the solution.