NUMA node specific

Inspur · ‎10-25-2024

Dear all,

We have a cluster node equiped with two CXL memory cards (each contains 8 x DDR5 4800MHz memory). In the OS, the two CXL cards' memory are shown as two new NUMA nodes, where the OS has 6 NUMA nodes in total (4 DDR NUMA + 2 CXL NUMA).

For the memory-bandwidth limited MPI applications, we want to apply the memory interleaving within some MPI processes to achieve higher memory bandwidth. In our test, coupled `mpirun` + `numactl` is effective for the pure MPI applications. For example, the CFD application OpenFOAM could use following command to run with DDR numa node.

mpirun -n 64 simpleFoam -parallel

To utilize the CXL memory with DDR memory, we utilize the numactl command before and after the mpirun command as follows.

numactl --weighted-interleave=0-5 mpirun -n 64 simpleFoam -parallel

or

mpirun -n 64 numactl --weighted-interleave=0-5 simpleFoam -parallel

With the CXL + DDR numa node, the OpenFOAM could achieve higher performance with higher bandwidth then the DDR memory bandwidth.

However, we want to specific the CXL numa node within the MPI Memory Policy Control environment variables, such as I_MPI_BIND_NUMA and I_MPI_BIND_ORDER. It wired that the performance does not change when we utilize these environments as follows.

export I_MPI_BIND_NUMA=0-5
export I_MPI_BIND_ORDER=scatter

We want to know how to specific the numa node for each processor, and could we defined the interleave weight parameters for multiply numa node?

For example:

Rank 0-15  : NUMA node 0, 4; Weight 4, 1
Rank 16-31 : NUMA node 1, 4; Weight 4, 1
Rank 32-47 : NUMA node 2, 5; Weight 4, 1
Rank 48-63 : NUMA node 3, 5; Weight 4, 1

TobiasK · ‎11-05-2024

@Inspur please use the priority support channel for such requests. At the moment, the way you propose with numactl is the preferred way.