Solved: CFLUSH overhead

Steven_H_2 · ‎10-17-2016

Hello everyone, I want to do a data copy and flush the cache line before each data copy. My implementation are as follow.
I have 3 questions:
1. If my implementation is correct.
2. Can i use openmp directives to implement "parallel" cflush operations? Or is there any methods to implement it?
3. What's performance like, if i cflushing a cache line that not existing in the cache.
Thanks.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <stdint.h>

static size_t MM = 900;
static int size_int = 4;


inline void cflush_yh(volatile int *p) {
    asm volatile ("clflush (%0)" :: "r"(p));

}
void data_copy (int *src, int *dst, size_t n) {
	int i;
	int flush_range = sizeof(int);
	#pragma omp parallel for
	for (i = 0; i<n; i++) {
		dst = src  ;
		cflush_yh(src);
		//cflush_yh(src);
	}
}

int main() {
    int *a, *b;
    int i;
    clock_t start,end;
    double time_elapsed;
 
    a = (int *) malloc(MM*sizeof(int));
		b = (int *) malloc(MM*sizeof(int));


    for(i = 0; i < MM; ++i){
        a = i;
        b = MM-i;
    }

    start = clock();
    data_copy(a,b,MM);
    end = clock();
    time_elapsed = (end - start)/(double)CLOCKS_PER_SEC;

    printf("Time elapsed = %lf\n",time_elapsed);

    free(a);
    free(b);

    return 0;
  }

McCalpinJohn · ‎10-17-2016

The overhead of the CLFLUSH instruction depends on both the implementation and on the use case.

The example above is potentially a very bad idea for performance, since the CLFLUSH on src may cause up to 15 subsequent elements of src[] to be evicted from the cache before they are used. This could cause each cache line of the src[] array to be loaded from memory multiple times. The details depend on the processor generation and the optimization level used for the compilation of the code.

Because the CLFLUSH instruction is not ordered with respect to reads of the same cache line, any loads that are delayed (by a cache miss, for example) could still be pending when the CLFLUSH executes. This could cause the cache line to be evicted before it is actually used. The hardware will guarantee that the load will actually succeed, but it may require multiple memory accesses to do so -- even if there is only one CLFLUSH per cache line.

There is nothing wrong with using CLFLUSH in OpenMP parallel regions -- especially if the target addresses are non-overlapping.

The overhead of CLFLUSH is generally quite low -- it requires at least one issue slot to a read/write port, and may require additional micro-ops. This is one of the few instructions that Agner Fog does not track in his (otherwise) comprehensive collection of performance data at http://www.agner.org/optimize/instruction_tables.pdf. ; The CLFLUSH instruction is required to remove the cache line from *all* processor caches in the entire system, so it will require many of the same resources that are used to track a store that misses in all levels of the cache. If these resources are already busy, then the CLFLUSH may extend the overall program execution time, but the effect is indirect and difficult to quantify.

If the goal is to minimize the "cache pollution" caused by storing the source array, a much safer approach (from the performance perspective) would be to prefetch each cache line of the source array using the "PREFETCHNTA" instruction. The behavior is implementation-dependent (and I have not tested what this does on recent Intel processors), but the semantics are cleaner -- load the data, but treat it as a low priority for holding in the caches because I do not expect to use it again soon.

I use CLFLUSH when I am investigating the mapping of physical addresses to L3 cache slices or to memory controllers (i.e., so that I can repeatedly access a single address without the line being cached in the L1, L2, etc.). The global scope of the instruction suggests that it is intended as an aid to correctness in certain special cases, rather than as an aid to performance optimization. An alternative "local cache flush" makes more sense for performance optimization. This is discussed in my U.S. Patent 7,194,587 (https://www.google.com/patents/US7194587) and is the approach used in the CLEVICT* instructions in the first-generation Xeon Phi ("Knights Corner") architecture (https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf)

View solution in original post

McCalpinJohn · ‎10-17-2016

The overhead of the CLFLUSH instruction depends on both the implementation and on the use case.

The example above is potentially a very bad idea for performance, since the CLFLUSH on src may cause up to 15 subsequent elements of src[] to be evicted from the cache before they are used. This could cause each cache line of the src[] array to be loaded from memory multiple times. The details depend on the processor generation and the optimization level used for the compilation of the code.

Because the CLFLUSH instruction is not ordered with respect to reads of the same cache line, any loads that are delayed (by a cache miss, for example) could still be pending when the CLFLUSH executes. This could cause the cache line to be evicted before it is actually used. The hardware will guarantee that the load will actually succeed, but it may require multiple memory accesses to do so -- even if there is only one CLFLUSH per cache line.

There is nothing wrong with using CLFLUSH in OpenMP parallel regions -- especially if the target addresses are non-overlapping.

The overhead of CLFLUSH is generally quite low -- it requires at least one issue slot to a read/write port, and may require additional micro-ops. This is one of the few instructions that Agner Fog does not track in his (otherwise) comprehensive collection of performance data at http://www.agner.org/optimize/instruction_tables.pdf. ; The CLFLUSH instruction is required to remove the cache line from *all* processor caches in the entire system, so it will require many of the same resources that are used to track a store that misses in all levels of the cache. If these resources are already busy, then the CLFLUSH may extend the overall program execution time, but the effect is indirect and difficult to quantify.

If the goal is to minimize the "cache pollution" caused by storing the source array, a much safer approach (from the performance perspective) would be to prefetch each cache line of the source array using the "PREFETCHNTA" instruction. The behavior is implementation-dependent (and I have not tested what this does on recent Intel processors), but the semantics are cleaner -- load the data, but treat it as a low priority for holding in the caches because I do not expect to use it again soon.

I use CLFLUSH when I am investigating the mapping of physical addresses to L3 cache slices or to memory controllers (i.e., so that I can repeatedly access a single address without the line being cached in the L1, L2, etc.). The global scope of the instruction suggests that it is intended as an aid to correctness in certain special cases, rather than as an aid to performance optimization. An alternative "local cache flush" makes more sense for performance optimization. This is discussed in my U.S. Patent 7,194,587 (https://www.google.com/patents/US7194587) and is the approach used in the CLEVICT* instructions in the first-generation Xeon Phi ("Knights Corner") architecture (https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf)

Steven_H_2 · ‎10-21-2016

Hi John, It is my great honor to get prompt reply from you.

According to the results that i get, clflush really introduced larger overhead, and thanks for you advice.

But there are another two question that come to my mind and i wonder if you can answer it or not.

1. Do we need to add an MFENCE/SFENCE instruction or other instruction after each CLFLUSH, if i want to make sure that the cache is flushed to memory.

e.g. i want to load a[0] to b[0], and i want to make sure that the next time that i read b[0], i am reading it from memory instead of cache. What instructions should i insert between this two operations.

b[0] = a[0]

// instructions to be inserted.

c = b[0]

2. The overhead of evict a non-existing cache line is much larger than dirty cache line. Why?

e.g.

b[0] = a[0];

cflush(&b[0]); //The first time i flush the cache line.

cflush(&b[0]); // The second time i flush the cache line.

It is interesting that the overhead second cflush is much larger that the first one.

John McCalpin wrote:

The overhead of the CLFLUSH instruction depends on both the implementation and on the use case.

The example above is potentially a very bad idea for performance, since the CLFLUSH on src may cause up to 15 subsequent elements of src[] to be evicted from the cache before they are used. This could cause each cache line of the src[] array to be loaded from memory multiple times. The details depend on the processor generation and the optimization level used for the compilation of the code.

Because the CLFLUSH instruction is not ordered with respect to reads of the same cache line, any loads that are delayed (by a cache miss, for example) could still be pending when the CLFLUSH executes. This could cause the cache line to be evicted before it is actually used. The hardware will guarantee that the load will actually succeed, but it may require multiple memory accesses to do so -- even if there is only one CLFLUSH per cache line.

There is nothing wrong with using CLFLUSH in OpenMP parallel regions -- especially if the target addresses are non-overlapping.

The overhead of CLFLUSH is generally quite low -- it requires at least one issue slot to a read/write port, and may require additional micro-ops. This is one of the few instructions that Agner Fog does not track in his (otherwise) comprehensive collection of performance data at http://www.agner.org/optimize/instruction_tables.pdf.    The CLFLUSH instruction is required to remove the cache line from *all* processor caches in the entire system, so it will require many of the same resources that are used to track a store that misses in all levels of the cache. If these resources are already busy, then the CLFLUSH may extend the overall program execution time, but the effect is indirect and difficult to quantify.

If the goal is to minimize the "cache pollution" caused by storing the source array, a much safer approach (from the performance perspective) would be to prefetch each cache line of the source array using the "PREFETCHNTA" instruction. The behavior is implementation-dependent (and I have not tested what this does on recent Intel processors), but the semantics are cleaner -- load the data, but treat it as a low priority for holding in the caches because I do not expect to use it again soon.

I use CLFLUSH when I am investigating the mapping of physical addresses to L3 cache slices or to memory controllers (i.e., so that I can repeatedly access a single address without the line being cached in the L1, L2, etc.).    The global scope of the instruction suggests that it is intended as an aid to correctness in certain special cases, rather than as an aid to performance optimization. An alternative "local cache flush" makes more sense for performance optimization.   This is discussed in my U.S. Patent 7,194,587 (https://www.google.com/patents/US7194587) and is the approach used in the CLEVICT* instructions in the first-generation Xeon Phi ("Knights Corner") architecture (https://software.intel.com/sites/default/files/forum/278102/327364001en.pdf)

McCalpinJohn · ‎10-21-2016

The first question is a difficult one because there is essentially no guarantee that the CLFLUSH instruction will do what you expect it to do. The description of the operation of the CLFLUSH instruction in Volume 2 of the Intel Architectures SW Developer's Manual notes:

[...] data can be speculatively loaded into a cache line just before, during, or after the execution of a CLFLUSH instruction that references the cache line [....]

So the instruction will cause the processor to flush that line from the caches, but you don't know exactly when it will happen, or whether the processor will decide to speculatively fetch the line back into the cache after the CLFLUSH executes.

As far as ordering is concerned, the CLFLUSH instruction is ordered with respect to other CLFLUSH instructions, even if they are to different addresses, but not with respect to reads, and not with respect to writes to other cache lines. So the only way to ensure that the CLFLUSH does not attempt to execute before the corresponding read has completed is to add a fence. You do NOT want to execute a memory fence or a flush for every element. If you want the maximum chance of correctness, then one fence per cache line is sufficient, but the performance penalty will be high. A better approach is to operate on "blocks" of some integral number of cache lines. Execution of "block j" reads the j-th block of src, writes the j-th block of dst, and executes CLFLUSH on the (j-1)-th block of src. Then you only need one fence per "block" to guarantee that the CLFLUSH cannot be executed ahead of the corresponding read.

The second question is much easier! The first flush is fast because the data is dirty in the cache. If a cache line is dirty, no other cache in the system is allowed to have a copy of the cache line. In this case, the hardware knows that it only needs to initiate a writeback of the line and invalidate it in the local cache, and it does not need to send any messages to any other caches (or wait for responses to those messages). After the first flush, the data is not in the local cache, so the hardware has to do a global cache invalidation and wait for the responses from all the other caches.