Why does CLFLUSHOPT improve remote CXL memory write performance?

XiaoxiangWu · ‎06-03-2026

Hello everyone and @McCalpinJohn ,

I'm currently experimenting with and optimizing the use of CXL memory, both locally and remotely across NUMA nodes.

My test platform consists of:

2 × Intel(R) Xeon(R) Platinum 8468
2 NUMA nodes, 48 cores per socket
Hyper-Threading and Turbo Boost disabled
256 GB DDR memory attached to each NUMA node
1 × 256 GB CXL memory expansion card attached to NUMA node 0
The CXL device is configured as a CPU-less NUMA node (node 2)

For this experiment, I do not use DDR memory. Instead, I force allocation from the CXL device while executing on the remote socket:

numactl -N 1 -m 2 ./a.out

And I always compile with:

gcc -fopenmp -O3 cxl.memset.c

I started with a simple multi-threaded memset benchmark:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <x86intrin.h>

int main() {

  char *mem;
  const size_t size = 1073741824;
  const size_t gran = 64;
  if (posix_memalign((void **)&mem, gran, size))
    return 1;
  memset(mem, 0, size);

  omp_set_num_threads(48);
  double st = omp_get_wtime();

  for (int r = 0; r <= 100; r++) {

#pragma omp parallel for
    for (size_t i = 0; i < size; i += gran) {
      memset(mem + i, (int)i, gran);
      asm volatile("clflushopt %0" : "+m"(*(volatile char *)(mem + i)));
    }
  }

  double elapsed = omp_get_wtime() - st;

  printf("elapsed: %.2fs\n", elapsed);
}

The observation I do not yet understand is that adding clflushopt after each 64-byte write significantly improves performance.

Without clflushopt: 9.61 s
With clflushopt: 4.61 s

Interestingly, this effect only appears when the CXL memory is accessed remotely from another NUMA node. When the same experiment is run locally, clflushopt provides little to no benefit. There is also no benefit when accessing remote DRAM. So it is unique to remote CXL memory.

I investigated this using perf stat, including Topdown metrics and lower-level groups such as:

tma_backend_bound_group
tma_memory_bound_group
tma_store_bound_group

The counters mostly confirm what I already know (e.g., fewer store stalls, fewer cycles with outstanding RFOs, etc.), but they do not explain the underlying mechanism responsible for such a large speedup.

Has anyone encountered a similar phenomenon, particularly with remote CXL-attached memory? Is there a microarchitectural explanation for why aggressively flushing cache lines would improve throughput so dramatically in this scenario?

Thanks in advance for any insights.