What's your targeting processor? And the optimization option you used? Is it default -mSSE2?
Besides the clflush cycles, have you counted the rdtsc latency?
Compiler optimization may re-order instructions based on instruction latency/throughput targeting different micro-architecture.
Comparing #4 and #5. Regardless of the instruction reorder, I notice that subtracting the address of the first rdtsc from the second produces an instruction byte count (hex) of 0x4A for the Intel, and 0x47 for the MinGW. IOW there are 3 extra bytes not accounted for.
There is an option to display the instruction byte codes, can you enable that?
Also, there may be a minor flaw in your test program. Prior to your first rdtsc, you are issuing a series of prefetches. From you code, it is not clear as to:
a) if an alignment issue causes the array to spill over an extra cache line in one scenario and not the other(s).
b) (possibly more important) if the prefetches are still in flight when the clflush is issued in one scenario and not the other.
As for b) I suggest you manipulate the prefetched data in a manner that assures the data has reached L1 before you start your timed run of clflushes.
if you make the first rdtsc located at the end of a cache line (and clflushes begin in next line), .AND. if you place your performance test code in a loop, what is the timing excluding the first trip through the test code? And what is the timing of say the 10'th iteration. IOW after you are assured the code sequence is in the L1 Instruction Cache. Note, code preceding and following the timed interval must not evict the instructions from the L1 Instruction Cache.
I am pretty sure that the overheads of multiple calls to RDTSC can't "cancel out" as suggested in message 14 above (https://software.intel.com/en-us/forums/intel-c-compiler/topic/697062#comment-1885846). This would require that the first call to RDTSC return the cycle count at the end of its execution, while the second call to RDTSC would have to return the cycle count at the beginning of its execution. This does not make sense.
I looked at the overlap of RDTSC and RDTSCP instructions with user code in some detail in a new post at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring...