Hi, Dr Bandwidth,
I extracted a few critical comments from your previous replies, https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring... and https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring....
I just want to ask if my conclusion at the end of this post is correct or not.
"The *ordering* property only applies within each logical processor's instruction stream, so OpenMP parallel loops will be able to run concurrently. Within each OpenMP thread, the CLFLUSH instructions will execute in program order, but there may be a great deal of concurrency even within that single thread. "
"There is nothing wrong with using CLFLUSH in OpenMP parallel regions -- especially if the target addresses are non-overlapping."
"As far as ordering is concerned, the CLFLUSH instruction is ordered with respect to other CLFLUSH instructions, even if they are to different addresses, but not with respect to reads, and not with respect to writes to other cache lines"
"The overhead of CLFLUSH is generally quite low -- it requires at least one issue slot to a read/write port, and may require additional micro-ops."
"The CLFLUSH instruction is required to remove the cache line from *all* processor caches in the entire system, so it will require many of the same resources that are used to track a store that misses in all levels of the cache. If these resources are already busy, then the CLFLUSH may extend the overall program execution time, but the effect is indirect and difficult to quantify."
"The instruction will cause the processor to flush that line from the caches, but you don't know exactly when it will happen, or whether the processor will decide to speculatively fetch the line back into the cache after the CLFLUSH executes."
Based on your answer, my impression is that if multiple CLFLUSH flushes different cache lines and these multiple CLFLUSH come from different threads (in other words, different logical processors' instruction streams), then these CLFLUSH should be able to run in parallel. If multiple CLFLUSH come from the same thread, then they cannot run in parallel. The point of having CLFLUSHOPT is to allow flushing multiple cache lines in parallel within a single logical processor's instruction stream.
Looking forward to your reply, Dr Bandwidth. Thanks.
There is a difference between instructions being "ordered" and instructions that "cannot run in parallel".
The CLFLUSH instruction is "ordered" with respect to other CLFLUSH instructions in the same logical processor's instruction stream. That means that the effect of the instructions has to appear to occur in program order. In most cases, cache coherence is maintained automatically by the hardware, so a CLFLUSH has no effect on program state -- by changing what data is held in the caches, the only effect it has is on performance. Performance is generally *not* part of the program state.
It is often possible to guarantee that the appearance of the effect of the instructions is consistent with program order while still allowing substantial overlap. This is the whole point of pipelining. It may take hundreds of cycles to complete the execution of a CLFLUSH instruction on a "clean" address (since this requires a global invalidation snoop), but (depending on the implementation) it may not be necessary to wait for the first CLFLUSH to be "complete" before the second CLFLUSH can begin execution. It is only necessary to wait long enough to ensure that no other processor in the system can see the transactions out of order before beginning the execution of the next CLFLUSH. Depending on the details of the implementation, the delay may be as short as a single cycle. For CLFLUSH on a "dirty" cache line, no global invalidation is needed. Only one cache can hold the line in Modified state, so the execution of the CLFLUSH only requires that the writeback of the dirty cache line be initiated. The details will differ depending on which level of the cache holds the Modified copy of the line, as well as what transaction types are available on the internal bus. For Intel processors, this is the "ring bus", and as far as I know there is no public documentation of the protocol used by this interface.
The existence of the CLFLUSHOPT instruction implies that something about the ordering of CLFLUSH instructions creates some overhead, but it is not clear to me whether this is due to the relaxation of the ordering with respect to (1) CLFLUSH operations to different addresses, (2) CLFLUSHOPT operations to different addresses, or (3) Write operations to different addresses. My guess is that (3) is the most important, but the real answer is a combination of very low-level implementation details and Intel's priorities for optimization of their target workloads. Neither of these is likely to become public in any useful detail....
In reviewing this topic, I noticed that none of the discussions of CLFLUSH or CLFLUSHOPT make any reference to the processor store buffer (as discussed in Section 8.2 of Volume 3 of the Intel Architectures SW Developer's Manual). The ordering requirement between CLFLUSH and writes suggests that a side effect of the CLFLUSH instruction is flushing the store buffers (i.e., writing whatever is in the store buffers to the L1 Data Cache, so that all prior stores become visible on the coherence fabric). Flushing partially filled store buffers prevents the store buffer from effectively reducing the number of L1 Data Cache write cycles -- this both takes time and energy and is often not required to obtain the semantic behavior desired by the user. With the CLFLUSHOPT instruction, a store buffer only needs to be flushed if it holds data from the same cache line that the CLFLUSHOPT is accessing. Store buffers holding any other addresses can continue to "cache" their stored data until some other mechanism forces them to push that data to the L1 Data Cache (where it becomes visible to all agents in the coherence fabric).