Solved: Re: Measure the execution time using RDTSC

AHoyle · ‎03-03-2022

Hi

I have been trying to use RDTSC and RDTSCP to measure the execution time of some code under test. I found the article “How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures”, September 2010 (www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf)

I have tried to implement the suggested test method and I have a few questions. The paper suggest the following steps:

CPUID

RDTSC

mov edx, %0

mov eax, %1

** Code under test **

RDTSCP

mov edx, %2

mov eax, %3

CPUID

I understand the CPUID instructions ensure full serialization to prevent the execution of other instructions crossing these instructions. The “mov” instructions copy the registers to appropriate location so the difference in timings measured can be calculated later. When I test these steps I get quite a bit of variation in the timings measured. If I replace the RDTSC with RDTSCP I get much better results. Which suggests the RDTSC instruction is not waiting for the first CPUID to complete before starting to read the TSC, is this correct?

I understand that RDTSCP waits for the previous command to finish, hence replacing the RDTSC with a RDTSCP seems to work.

What is preventing the instructions from my code under test, moving between the first CPUID and RDTSCP instructions?

Does the RDTSCP prevent instructions moving before it?

And if is does do I need the first CPUID instruction?

My processor is an Intel Atom E3950

Sorry for such a long question.

Kindest Regards.

McCalpinJohn · ‎03-04-2022

CPUID is a very heavy-weight instruction -- 200 cycles? It does serialize everything, but at a very high cost.

This can be a complicated subject, but I think this writeup is still accurate:

https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

View solution in original post

McCalpinJohn · ‎03-04-2022

CPUID is a very heavy-weight instruction -- 200 cycles? It does serialize everything, but at a very high cost.

This can be a complicated subject, but I think this writeup is still accurate:

https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

AHoyle · ‎03-09-2022

Hi John

Thank you for your response. I am not too worried by the duration of the CPUID instruction as I think it is outside of the measurement window, and the speed of the overall test is not that important. Reading the writeup you suggested does show that this subject is much more complex than I had appreciated. I think I will add a LFENCE instruction after the first RDTSCP instruction to try to prevent my code under test executing before the RDTSCP has read the ‘start’ time. However, I don’t think I am going to have the time to study this problem in enough detail to fully understand it. So for now I will have to accept some variation in my test results. I just can’t justify spending much more time on this.

Thanks again for your help.

Alastair

nubfactor · ‎05-27-2025

Hi All,

Sorry to be bringing up old things, but does anyone have a copy of the PDF ia-32-ia-64-benchmark-code-execution-paper.pdf as linked in this original post?

Apart from the link not working, I've tried various ways of finding versions. I am quite interested to read this.

Thank you!

Matt/nubfactor

nubfactor · ‎05-28-2025

A version of the file has been received. Thank you A.H.