https://patents.google.com/patent/US5657253A/en?q=time+stamp+counter&assignee=Intel

SergeyKostrov · ‎10-05-2016

*** Minimal Averaged Delta of Intel RDTSC and RDTSCP instructions ***

SergeyKostrov · ‎10-07-2016

[ CPU: Ivy Bridge - Watcom C++ compiler - 32-bit ] [ Sub-Test002.01.A - RDTSC ] - Started TSC Minimal Averaged Delta is 25.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 24.60 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 26.20 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles [ Sub-Test002.01.A - RDTSC ] - Completed [ Sub-Test002.01.B - RDTSC ] - Not Supported [ Sub-Test002.01.C - RDTSCP ] - Not Supported [ Sub-Test002.01.D - RDTSCP ] - Not Supported

SergeyKostrov · ‎10-07-2016

[ CPU: Ivy Bridge - Watcom C++ compiler - 64-bit ] [ Sub-Test002.01.A - RDTSC ] - Started TSC Minimal Averaged Delta is 26.60 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles [ Sub-Test002.01.A - RDTSC ] - Completed [ Sub-Test002.01.B - RDTSC ] - Not Supported [ Sub-Test002.01.C - RDTSCP ] - Not Supported [ Sub-Test002.01.D - RDTSCP ] - Not Supported

SergeyKostrov · ‎10-07-2016

Examples of disassembler codes for RDTSC and RDTSCP instructions will be posted later.

SergeyKostrov · ‎10-10-2016

[ An example of disassembled codes for a test with RDTSC instruction - 32-bit ] ... 0024AA47 rdtsc 0024AA49 mov ecx, eax 0024AA4B rdtsc 0024AA4D rdtsc 0024AA4F rdtsc 0024AA51 rdtsc 0024AA53 rdtsc 0024AA55 rdtsc 0024AA57 rdtsc 0024AA59 rdtsc 0024AA5B rdtsc 0024AA5D rdtsc 0024AA5F sub eax, ecx ...

SergeyKostrov · ‎10-10-2016

[ An example of disassembled codes for a test with RDTSCP instruction - 64-bit ] ... 000000013F652A81 rdtscp 000000013F652A84 mov rbx, rax 000000013F652A87 rdtscp 000000013F652A8A rdtscp 000000013F652A8D rdtscp 000000013F652A90 rdtscp 000000013F652A93 rdtscp 000000013F652A96 rdtscp 000000013F652A99 rdtscp 000000013F652A9C rdtscp 000000013F652A9F rdtscp 000000013F652AA2 rdtscp 000000013F652AA5 sub rax, rbx ...

Bernard · ‎12-09-2016

Interesting results.

Agner Fog's manuals provide different result for RDTSC throughput a bit higher than your results of latency.

Unfortunately he did not provide any data about potential CPU clock consumption of RDTSC latency.

Bernard · ‎12-09-2016

Why do not you serialize uop of RDTSC execution?

Afaik RDTSC is not serializing instruction so in theory multiple of them can be executed at the same time and at least partially overlap pipelined execution.

SergeyKostrov · ‎12-09-2016

>>...Agner Fog's manuals provide different result for RDTSC throughput a bit higher than your results of latency. That is possible because it looks like he used a different generation CPU. Post these RDTSC and RDTSCP numbers for review with a CPU information.

SergeyKostrov · ‎12-09-2016

>>Why do not you serialize uop of RDTSC execution? >> >>Afaik RDTSC is not serializing instruction so in theory multiple of them can be executed at the same >>time and at least partially overlap pipelined execution. That is why I tried to fill a CPU pipeline with at least 10 RDTSC or RDTSCP instructions.

Bernard · ‎12-11-2016

I am posting here RDTSC reciprocal throughput result as stated by Agner Fog.

CPU Arch: Ivy Bridge , RDTSC Reciprocal Throughput: 27 CPU clock cycles.

Reference p. 175

http://www.agner.org/optimize/instruction_tables.pdf

Bernard · ‎12-11-2016

>>>That is why I tried to fill a CPU pipeline with at least 10 RDTSC or RDTSCP instructions.>>>

I am still puzzled by at least some probable (Hardware level) pipelined execution of those 10 micro-ops. I will try to find some information at Google patents which may shed some light on proposed (patented) implementation of RDTSC instruction.

Bernard · ‎12-11-2016

I have found an Intel patent titled "Apparatus for monitoring the performance of a microprocessor" and there is no clear information about pipelined read of TSC.

Link to aforementioned article:

https://patents.google.com/patent/US5657253A/en?q=time+stamp+counter&assignee=Intel

SergeyKostrov · ‎12-12-2016

>>Have you tried this experiment on v4 or v3 cpus? In particular E5-2699 v3 and E5-2699 v4? Here are results of my tests for Intel Xeon Phi Processor 7210: http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core ) Processor name : Intel(R) Xeon Phi(TM) 7210 Packages (sockets) : 1 Cores : 64 Processors (CPUs) : 256 Cores per package : 64 Threads per core : 4 [ Output for RDTSC instruction ] ... Access Time to TSC: 36.40 clock cycles Access Time to TSC: 37.70 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 37.70 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles ...