Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Concurrency support of CLFLUSH


As far as i know, the new CLFLUSHOPT instruction is  non-blocking while CLFLUSH is blocking. As shown in Page 20 of Ref 1.

But to my surprise is that, we design a micro benchmark as shown below. When i change the OMP_NUM_THREADS from 1 to 32, I did find performance gain.

I just want to know if CLFLUSH is a blocking command. And in the case of OpenMP environment, do we need to replace clflush with clflushopt if we have instruction support.

inline void cflush_yh(volatile int *p) {
    asm volatile ("clflush (%0)" :: "r"(p));

static __inline__ unsigned long long rdtsc(void){
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A"(x));
    return x;

void omp_flush(int *src, int *dst, size_t n) {
    int i;
    int * addr[900000];
    unsigned long long cycle, elapsed;
    for (i=0; i<n; i++){
        dst = src ;
        addr = &dst;
    cycle = rdtsc();
    #pragma omp parallel for
    for (i=0; i<n; i++){
    elapsed = rdtsc() - cycle;
    printf("flush cycles: %d\n", (unsigned)elapsed/MM);

int main() {
    int *a, *b;
    int i;
    clock_t start,end;
    double time_elapsed;
    a = (int *) malloc(MM*sizeof(int));
    b = (int *) malloc(MM*sizeof(int));
    for(i = 0; i < MM; ++i){
        a = i;
        b = MM-i;
    return 0;



0 Kudos
1 Reply
Black Belt

I am not sure that this is measuring what you think it is measuring....

It is a bit hard to tell what you are doing -- the array being flushed "addr" is private to the "omp_flush()" function, and is not ever used again after the "clflush" instruction, so the compiler does not need to actually perform either the assignment to the function or the "clflush" operation.

You don't provide any examples of the timings from the function, so it is not clear whether the reported time is significant compared to the overhead of the "omp parallel for" region.

You don't need to flush every array element. Cache lines are 64 Bytes on every recent Intel processor, so for 64-bit pointers you only need to flush every 8th element of addr[] to ensure that you have referenced every cache line.

The term "blocking" is not appropriate for describing the ordering of CLFLUSH.   As I explained a few months ago (, CLFLUSH is *ordered* with respect to other CLFLUSH operations, but not with respect to CLFLUSHOPT operations to different addresses.   The *ordering* property only applies within each logical processor's instruction stream, so OpenMP parallel loops will be able to run concurrently.  Within each OpenMP thread, the CLFLUSH instructions will execute in program order, but there may be a great deal of concurrency even within that single thread.   Again (as I also explained in October), CLFLUSH instructions on dirty data are expected to be lightweight -- if the data is dirty in the cache, then no other cache in the system can hold the data, and the execution of the CLFLUSH instruction simply requires initiation of a writeback operation from the cache to memory.  Since we know that the processor can execute stores (which are strongly ordered) with significant concurrency using a single logical processor, it should certainly be able to execute the writeback portion of the store operation with at least as much concurrency.

0 Kudos