- Marquer comme nouveau
- Marquer
- S'abonner
- Sourdine
- S'abonner au fil RSS
- Surligner
- Imprimer
- Signaler un contenu inapproprié
Hello everyone and @McCalpinJohn ,
I'm currently experimenting with and optimizing the use of CXL memory, both locally and remotely across NUMA nodes.
My test platform consists of:
- 2 × Intel(R) Xeon(R) Platinum 8468
- 2 NUMA nodes, 48 cores per socket
- Hyper-Threading and Turbo Boost disabled
- 256 GB DDR memory attached to each NUMA node
- 1 × 256 GB CXL memory expansion card attached to NUMA node 0
- The CXL device is configured as a CPU-less NUMA node (node 2)
For this experiment, I do not use DDR memory. Instead, I force allocation from the CXL device while executing on the remote socket:
numactl -N 1 -m 2 ./a.outAnd I always compile with:
gcc -fopenmp -O3 cxl.memset.c
I started with a simple multi-threaded memset benchmark:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <x86intrin.h>
int main() {
char *mem;
const size_t size = 1073741824;
const size_t gran = 64;
if (posix_memalign((void **)&mem, gran, size))
return 1;
memset(mem, 0, size);
omp_set_num_threads(48);
double st = omp_get_wtime();
for (int r = 0; r <= 100; r++) {
#pragma omp parallel for
for (size_t i = 0; i < size; i += gran) {
memset(mem + i, (int)i, gran);
asm volatile("clflushopt %0" : "+m"(*(volatile char *)(mem + i)));
}
}
double elapsed = omp_get_wtime() - st;
printf("elapsed: %.2fs\n", elapsed);
}The observation I do not yet understand is that adding clflushopt after each 64-byte write significantly improves performance.
- Without clflushopt: 9.61 s
- With clflushopt: 4.61 s
Interestingly, this effect only appears when the CXL memory is accessed remotely from another NUMA node. When the same experiment is run locally, clflushopt provides little to no benefit. There is also no benefit when accessing remote DRAM. So it is unique to remote CXL memory.
I investigated this using perf stat, including Topdown metrics and lower-level groups such as:
- tma_backend_bound_group
- tma_memory_bound_group
- tma_store_bound_group
The counters mostly confirm what I already know (e.g., fewer store stalls, fewer cycles with outstanding RFOs, etc.), but they do not explain the underlying mechanism responsible for such a large speedup.
Has anyone encountered a similar phenomenon, particularly with remote CXL-attached memory? Is there a microarchitectural explanation for why aggressively flushing cache lines would improve throughput so dramatically in this scenario?
Thanks in advance for any insights.
Lien copié
- S'abonner au fil RSS
- Marquer le sujet comme nouveau
- Marquer le sujet comme lu
- Placer ce Sujet en tête de liste pour l'utilisateur actuel
- Marquer
- S'abonner
- Page imprimable