Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

The right way to use _mm_clflush, _mm_clwb

huangwentao
New Contributor I
4,991 Views

Hi all, I am starting to use functions like _mm_clflush, _mm_clflushopt, and _mm_clwb.

Say now like I have defined a struct variable called 'mystruct' and its size is 512bytes.

If I want to flush the cache line containing the address of 'mystruct', which way is the right way to flush:

_mm_clflush(&mystruct)

or 

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i)

}

Anybody can tell me which is the right way to flush?

Many thanks for the help.

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
4,970 Views

These intrinsics accept a single address and perform the requested operation on the cache line containing that address.  So you will need 8 or 9 CLFLUSH operations for a 512 Byte structure (depending on its alignment).  

View solution in original post

7 Replies
huangwentao
New Contributor I
4,985 Views

I realize my above coding has a mistake (the second setting).

It should be amended as follow (I would like to flush every cache line that the address of mystruct consists of):

for (int i = 0; i < sizeof(mystruct)/64; i++) {

     _mm_clflush( ((char *)&mystruct) + i*64)

}

 

0 Kudos
McCalpinJohn
Honored Contributor III
4,971 Views

These intrinsics accept a single address and perform the requested operation on the cache line containing that address.  So you will need 8 or 9 CLFLUSH operations for a 512 Byte structure (depending on its alignment).  

huangwentao
New Contributor I
4,955 Views

Thanks for the clarification, John.

0 Kudos
McCalpinJohn
Honored Contributor III
4,938 Views

If the struct is not 64-Byte-aligned, this code will not flush the cache line containing the final partial cache line of the struct.

There are several approaches to implementing the more general code -- I almost always have to re-create and test these from scratch to make sure I got the logic right.  

I *think* that all you need to do is add:

if ( &mystruct%64 != 0 ) {
    _mm_clflush( ((char *)&mystruct) + 511);
}

 

0 Kudos
huangwentao
New Contributor I
4,922 Views

Thanks John.

So, in order to make coding easier, just add this code snippet after the flushing loop for every 64B and let it decides whether an additional cache line flush is needed for the struct.

0 Kudos
McCalpinJohn
Honored Contributor III
4,887 Views

This extra code is not always required -- it depends on the specific combination of the length of the struct and its alignment relative to cache line boundaries.  There are other ways to structure the logic -- you could take the floor of the starting address/64Bytes and the ceiling of the ending address/64Bytes and use those as the loop bounds.  

CLFLUSH is a relatively low-overhead operation, so flushing the highest address in the structure every time (rather than working through the logic) won't have a noticeable performance impact.

0 Kudos
huangwentao
New Contributor I
4,877 Views
0 Kudos
Reply