Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

The right way to use _mm_clflush, _mm_clwb

huangwentao
New Contributor I
1,405 Views

Hi all, I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb.

Say now like I have defined a struct variable called 'mystruct' and its size is 512bytes.

If I want to flush the cache line containing the address of 'mystruct', which way is the right way to flush:

_mm_clflush(&mystruct)

or 

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i)

}

Anybody can tell me which is the right way to flush?

Many thanks for the help.

0 Kudos
1 Solution
McCalpinJohn
Black Belt
1,384 Views

These intrinsics accept a single address and perform the requested operation on the cache line containing that address.  So you will need 8 or 9 CLFLUSH operations for a 512 Byte structure (depending on its alignment).  

View solution in original post

7 Replies
huangwentao
New Contributor I
1,399 Views

I realize my above coding has a mistake (the second setting).

It should be amended as follow (I would like to flush every cache line that the address of mystruct consists of):

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i*64)

}

 

McCalpinJohn
Black Belt
1,385 Views

These intrinsics accept a single address and perform the requested operation on the cache line containing that address.  So you will need 8 or 9 CLFLUSH operations for a 512 Byte structure (depending on its alignment).  

huangwentao
New Contributor I
1,369 Views

Thanks for the clarification, John.

McCalpinJohn
Black Belt
1,352 Views

If the struct is not 64-Byte-aligned, this code will not flush the cache line containing the final partial cache line of the struct.

There are several approaches to implementing the more general code -- I almost always have to re-create and test these from scratch to make sure I got the logic right.  

I *think* that all you need to do is add:

if ( &mystruct%64 != 0 ) {
    _mm_clflush( ((char *)&mystruct) + 511);
}

 

huangwentao
New Contributor I
1,336 Views

Thanks John.

So, in order to make coding easier, just add this code snippet after the flushing loop for every 64B and let it decides whether an additional cache line flush is needed for the struct.

McCalpinJohn
Black Belt
1,301 Views

This extra code is not always required -- it depends on the specific combination of the length of the struct and its alignment relative to cache line boundaries.  There are other ways to structure the logic -- you could take the floor of the starting address/64Bytes and the ceiling of the ending address/64Bytes and use those as the loop bounds.  

CLFLUSH is a relatively low-overhead operation, so flushing the highest address in the structure every time (rather than working through the logic) won't have a noticeable performance impact.

huangwentao
New Contributor I
1,291 Views
Reply