Software Archive
Read-only legacy content
17061 Discussions

Minimal Averaged Delta of Intel RDTSC and RDTSCP instructions

SergeyKostrov
Valued Contributor II
1,257 Views
*** Minimal Averaged Delta of Intel RDTSC and RDTSCP instructions ***
0 Kudos
33 Replies
SergeyKostrov
Valued Contributor II
417 Views
[ CPU: Ivy Bridge - Watcom C++ compiler - 32-bit ] [ Sub-Test002.01.A - RDTSC ] - Started TSC Minimal Averaged Delta is 25.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 24.60 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 26.20 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles [ Sub-Test002.01.A - RDTSC ] - Completed [ Sub-Test002.01.B - RDTSC ] - Not Supported [ Sub-Test002.01.C - RDTSCP ] - Not Supported [ Sub-Test002.01.D - RDTSCP ] - Not Supported
0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
[ CPU: Ivy Bridge - Watcom C++ compiler - 64-bit ] [ Sub-Test002.01.A - RDTSC ] - Started TSC Minimal Averaged Delta is 26.60 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles TSC Minimal Averaged Delta is 25.40 clock cycles TSC Minimal Averaged Delta is 27.00 clock cycles [ Sub-Test002.01.A - RDTSC ] - Completed [ Sub-Test002.01.B - RDTSC ] - Not Supported [ Sub-Test002.01.C - RDTSCP ] - Not Supported [ Sub-Test002.01.D - RDTSCP ] - Not Supported
0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
Examples of disassembler codes for RDTSC and RDTSCP instructions will be posted later.
0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
[ An example of disassembled codes for a test with RDTSC instruction - 32-bit ] ... 0024AA47 rdtsc 0024AA49 mov ecx, eax 0024AA4B rdtsc 0024AA4D rdtsc 0024AA4F rdtsc 0024AA51 rdtsc 0024AA53 rdtsc 0024AA55 rdtsc 0024AA57 rdtsc 0024AA59 rdtsc 0024AA5B rdtsc 0024AA5D rdtsc 0024AA5F sub eax, ecx ...
0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
[ An example of disassembled codes for a test with RDTSCP instruction - 64-bit ] ... 000000013F652A81 rdtscp 000000013F652A84 mov rbx, rax 000000013F652A87 rdtscp 000000013F652A8A rdtscp 000000013F652A8D rdtscp 000000013F652A90 rdtscp 000000013F652A93 rdtscp 000000013F652A96 rdtscp 000000013F652A99 rdtscp 000000013F652A9C rdtscp 000000013F652A9F rdtscp 000000013F652AA2 rdtscp 000000013F652AA5 sub rax, rbx ...
0 Kudos
Bernard
Valued Contributor I
417 Views

Interesting results.

Agner Fog's manuals provide different result for RDTSC throughput a bit higher than your results of latency.

Unfortunately he did not provide any data about potential CPU clock consumption of RDTSC latency.

0 Kudos
Bernard
Valued Contributor I
417 Views

Why do not you serialize uop of RDTSC execution? 

Afaik RDTSC is not serializing instruction so in theory multiple of them can be executed at the same time and at least partially overlap pipelined execution.

0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
>>...Agner Fog's manuals provide different result for RDTSC throughput a bit higher than your results of latency. That is possible because it looks like he used a different generation CPU. Post these RDTSC and RDTSCP numbers for review with a CPU information.
0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
>>Why do not you serialize uop of RDTSC execution? >> >>Afaik RDTSC is not serializing instruction so in theory multiple of them can be executed at the same >>time and at least partially overlap pipelined execution. That is why I tried to fill a CPU pipeline with at least 10 RDTSC or RDTSCP instructions.
0 Kudos
Bernard
Valued Contributor I
417 Views

I am posting here RDTSC reciprocal throughput result as stated by Agner Fog.

CPU Arch:  Ivy Bridge ,  RDTSC Reciprocal Throughput: 27 CPU clock cycles.

Reference p. 175

http://www.agner.org/optimize/instruction_tables.pdf

 

0 Kudos
Bernard
Valued Contributor I
417 Views

>>>That is why I tried to fill a CPU pipeline with at least 10 RDTSC or RDTSCP instructions.>>>

I am still puzzled by at least some probable (Hardware level) pipelined execution of those 10 micro-ops. I will try to find some information at Google patents which may shed some light on proposed (patented) implementation of RDTSC instruction.

 

0 Kudos
Bernard
Valued Contributor I
417 Views

I have found an Intel patent titled "Apparatus for monitoring the performance of a microprocessor" and there is no clear information about pipelined read of TSC.

Link to aforementioned article:

https://patents.google.com/patent/US5657253A/en?q=time+stamp+counter&assignee=Intel

0 Kudos
SergeyKostrov
Valued Contributor II
417 Views
>>Have you tried this experiment on v4 or v3 cpus? In particular E5-2699 v3 and E5-2699 v4? Here are results of my tests for Intel Xeon Phi Processor 7210: http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core Intel Xeon Phi Processor 7210 ( 16GB, 1.30 GHz, 64 core ) Processor name : Intel(R) Xeon Phi(TM) 7210 Packages (sockets) : 1 Cores : 64 Processors (CPUs) : 256 Cores per package : 64 Threads per core : 4 [ Output for RDTSC instruction ] ... Access Time to TSC: 36.40 clock cycles Access Time to TSC: 37.70 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 37.70 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles Access Time to TSC: 36.40 clock cycles ...
0 Kudos
Reply