Why does CLFLUSHOPT improve remote CXL memory write performance?

XiaoxiangWu · ‎06-03-2026

Hello everyone and @McCalpinJohn ,

I'm currently experimenting with and optimizing the use of CXL memory, both locally and remotely across NUMA nodes.

My test platform consists of:

2 × Intel(R) Xeon(R) Platinum 8468
2 NUMA nodes, 48 cores per socket
Hyper-Threading and Turbo Boost disabled
256 GB DDR memory attached to each NUMA node
1 × 256 GB CXL memory expansion card attached to NUMA node 0
The CXL device is configured as a CPU-less NUMA node (node 2)

For this experiment, I do not use DDR memory. Instead, I force allocation from the CXL device while executing on the remote socket:

numactl -N 1 -m 2 ./a.out

And I always compile with:

gcc -fopenmp -O3 cxl.memset.c

I started with a simple multi-threaded memset benchmark:

#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <x86intrin.h>

int main() {

  char *mem;
  const size_t size = 1073741824;
  const size_t gran = 64;
  if (posix_memalign((void **)&mem, gran, size))
    return 1;
  memset(mem, 0, size);

  omp_set_num_threads(48);
  double st = omp_get_wtime();

  for (int r = 0; r <= 100; r++) {

#pragma omp parallel for
    for (size_t i = 0; i < size; i += gran) {
      memset(mem + i, (int)i, gran);
      asm volatile("clflushopt %0" : "+m"(*(volatile char *)(mem + i)));
    }
  }

  double elapsed = omp_get_wtime() - st;

  printf("elapsed: %.2fs\n", elapsed);
}

The observation I do not yet understand is that adding clflushopt after each 64-byte write significantly improves performance.

Without clflushopt: 9.61 s
With clflushopt: 4.61 s

Interestingly, this effect only appears when the CXL memory is accessed remotely from another NUMA node. When the same experiment is run locally, clflushopt provides little to no benefit. There is also no benefit when accessing remote DRAM. So it is unique to remote CXL memory.

I investigated this using perf stat, including Topdown metrics and lower-level groups such as:

tma_backend_bound_group
tma_memory_bound_group
tma_store_bound_group

The counters mostly confirm what I already know (e.g., fewer store stalls, fewer cycles with outstanding RFOs, etc.), but they do not explain the underlying mechanism responsible for such a large speedup.

Has anyone encountered a similar phenomenon, particularly with remote CXL-attached memory? Is there a microarchitectural explanation for why aggressively flushing cache lines would improve throughput so dramatically in this scenario?

Thanks in advance for any insights.

Steve_Jerome22 · ‎06-03-2026

Hi XiaoxiangWu,

Greetings for the day!

Thanks for contacting Intel Customer Support with your query:

Please refer to the below Intel documents which is related to microarchitecture, cache behavior, and CXL memory architecture

4th Gen Xeon Scalable overview:

4th Gen Intel Xeon Processor Scalable Family, sapphire rapids

CXL Memory Device Software Guide:

CXL* Memory Device Software Guide

Intel® 64 and IA-32 Architectures Software Developer’s Manual

Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4

Please let us know if you have any further questions.

Regards

Jerome

Intel Customer Support Technician

XiaoxiangWu · ‎06-04-2026

Hello Jerome,

Thank you for your reply.

Do these documents discuss or help explain the performance behavior I observed when accessing the CXL device from a remote NUMA node?

Regards,
Xiaoxiang

Subhashish · ‎06-04-2026

Hello XiaoxiangWu,

Thank you for your response.

The three documents shared to you earlier merely gives you an impression on your subjected question in the support ticket - Why does CLFLUSHOPT improve remote CXL memory write performance?

The documented weakly-ordered nature of CLFLUSHOPT combined with CXL's cache coherent architecture suggests that immediate cache line eviction should reduce coherency protocol overhead and create more predictable access patterns to CXL memory devices.

As of your question based on the test performed, whether there a microarchitectural explanation for why aggressively flushing cache lines would improve throughput so dramatically in this scenario, we do not have any exact documentation available in public to explain this but we will check this in our end and try to assist you in best possible manner within our scope of support.

Kindly await our next response.

Regards,

Subhashish_Intel.

XiaoxiangWu · ‎06-04-2026

Hello Subhashish,

Thank you very much for the follow-up.

My apologies if my question goes somewhat beyond the typical scope of a product support forum. I have also posted the same question on the Developer Software Forums.

I am actively investigating this issue myself. So far, I have used several publicly available tools, including Intel PCM and Linux perf. However, the measurements they provide seem relatively high-level. I can observe improvements in a number of expected counters, but none of them appear to explain the root cause of the performance difference.

Previously, I studied a similar throughput improvement on Intel Persistent Memory and published the work as *Pre-Stores: Proactive Software-guided Movement of Data Down the Memory Hierarchy*. In that case, aggressive use of `CLFLUSHOPT` significantly reduced write amplification. Interestingly, that conclusion could not be reached using PCM or perf counters alone; it required measurements from Intel's `ipmctl` tool. The situation here feels somewhat similar. I believe I need visibility into lower-level microarchitectural behavior, but I am currently unsure which tools or measurements would help me get there.

Any insights would be greatly appreciated, as would suggestions for profiling tools or methodologies that I may have overlooked.

Regards,
Xiaoxiang

Poojitha · ‎06-04-2026

Hi XiaoxiangWu,

Thank you for your patience. We have observed performance improvement when using CLFLUSHOPT is consistent with how the CPU handles cache lines and memory access over CXL.

The CLFLUSHOPT instruction forces modified cache lines to be written back to memory and invalidated from the CPU cache hierarchy.

Remote CXL memory (accessed from another NUMA socket) has higher latency and lower bandwidth compared to both local memory and local CXL access.

In this configuration, without CLFLUSHOPT, cache lines remain modified and owned by the core longer, which increases coherence traffic and causes more stalls when writing to remote CXL memory.

By using CLFLUSHOPT after each write, cache lines are pushed out of the cache hierarchy earlier, which:

Reduces cache coherency and ownership overhead

Lowers the number of outstanding write-related stalls (e.g., RFOs)

Improves throughput on the high-latency remote CXL path

This is why the performance gain is significant for remote CXL memory, while the impact is minimal for local memory or standard DRAM.

Note: There is currently no specific public microarchitectural document that fully explains this exact behavior for remote CXL scenarios, but the observed results align with known cache and CXL characteristics.

Please refer the below reference document on RDC,

https://www.intel.com/content/www/us/en/secure/content-details/678513/cxl-mem-controller-architecture-reference.html?DocID=678513

To access the above document required CNDA, If you do not have access to RDC, please refer to below link for How to Apply for an Intel® Resource and Documentation Center (RDC) and/or Intel® Developer Zone (Intel® DevZone) Account.

https://www.intel.com/content/www/us/en/support/articles/000058073/programs/resource-and-documentation-center.html?wapkw=000058073

Additionally, please note that standard Corporate Non-Disclosure Agreement (CNDA) is required to access the RDC. If you do not have an active CNDA, we recommend contacting one of our Intel® Representative / Distributors from your organization, who will be able to assist you in getting the CNDA signed with Intel.

We recommend that you contact the CXL vendor and the OS vendor for validation, as this falls outside the scope of our support.

Best regards,

Poojitha N

Intel Customer Support Technician