- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Sergey
What's your targeting processor? And the optimization option you used? Is it default -mSSE2?
Besides the clflush cycles, have you counted the rdtsc latency?
Compiler optimization may re-order instructions based on instruction latency/throughput targeting different micro-architecture.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
Comparing #4 and #5. Regardless of the instruction reorder, I notice that subtracting the address of the first rdtsc from the second produces an instruction byte count (hex) of 0x4A for the Intel, and 0x47 for the MinGW. IOW there are 3 extra bytes not accounted for.
There is an option to display the instruction byte codes, can you enable that?
Also, there may be a minor flaw in your test program. Prior to your first rdtsc, you are issuing a series of prefetches. From you code, it is not clear as to:
a) if an alignment issue causes the array to spill over an extra cache line in one scenario and not the other(s).
b) (possibly more important) if the prefetches are still in flight when the clflush is issued in one scenario and not the other.
As for b) I suggest you manipulate the prefetched data in a manner that assures the data has reached L1 before you start your timed run of clflushes.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey,
if you make the first rdtsc located at the end of a cache line (and clflushes begin in next line), .AND. if you place your performance test code in a loop, what is the timing excluding the first trip through the test code? And what is the timing of say the 10'th iteration. IOW after you are assured the code sequence is in the L1 Instruction Cache. Note, code preceding and following the timed interval must not evict the instructions from the L1 Instruction Cache.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am pretty sure that the overheads of multiple calls to RDTSC can't "cancel out" as suggested in message 14 above (https://software.intel.com/en-us/forums/intel-c-compiler/topic/697062#comment-1885846). This would require that the first call to RDTSC return the cycle count at the end of its execution, while the second call to RDTSC would have to return the cycle count at the beginning of its execution. This does not make sense.
I looked at the overlap of RDTSC and RDTSCP instructions with user code in some detail in a new post at https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/697093#comment-1886115
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page