Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

Synchronizing Time Stamp Counter

Roman_Oderov
Beginner
4,683 Views
Hello everyone! I have to synchronize time between processors in a multicore system i.e. I have to calculate TSC differences of all processors relative to one of them. I tried rdtsc() but it returned TSC of the current processor. Is there any way to get TSC from the necessary processor? Or may be I can define processor id somewhere and use an appropriate time stamp counter value. Thanks in advance, Roman
0 Kudos
76 Replies
SergeyKostrov
Valued Contributor II
813 Views
Let's assume a call to a Win32 API function 'QueryPerformanceCounter' has to be done on a multi-core system. What value is it going to return? A TSC of CPU1, CPU2, CPU3, etc? Let's also assume that I don't set a CPU for execution explicitly. I'll do another set of tests and I will try to predict a TSC value for a CPU2, for example.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
'...Intel guarantees that the time-stamp counter will not wraparound within 10 years after being reset...' I've calculated for a CPU with 3GHz clock speed a wraparound has to be ~194 years.
0 Kudos
Bernard
Valued Contributor I
813 Views
>>>>Let's assume a call to a Win32 API function 'QueryPerformanceCounter' has to be done on a multi-core system. What value is it going to return? A TSC of CPU1, CPU2, CPU3, etc? Let's also assume that I don't set a CPU for execution explicitly.>>> Probably TSC value of the CPU which executes current context thread which is in turn executing machine code of "QueryPerformanceCounter" .So it could be an arbitrary CPU.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
>>...Well, in older processors I can't still rely on TSC's of different cores without manual synchronization... What about a RESET signal that sets a TSC to 0? Does it mean that on a multi-core system the RESET signal occurs at different times for different CPUs? - T0 for CPU0 - T0+some-delay1 for CPU1 - T0+some-delay2 for CPU2, etc? How is it possible? Could Intel Hardware Engineers clearly explain it?
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
>>Probably TSC value of the CPU which executes current context thread which is in turn executing machine code of >>"QueryPerformanceCounter" .So it could be an arbitrary CPU. I agree with that. However, I can't find in Intel manuals any explanations for: 1. How many TSC registers exist on a multi-core system with many logical CPUs? Is it just one and which is shared between all logical cores? ( in that case TSCs are synchronized by default ) 2. Does every logical CPU have its own independent TSC register? Could different TSCs have different values at some time Tn? 3. What about a case when a system has many physical CPUs and every physical CPU has at least two logical CPUs?
0 Kudos
TimP
Honored Contributor III
813 Views
The Intel CPUs we've seen share tsc resource between hyperthreads and share the buss time clock among cores. Synchronization between sockets depends on action taken by the OS and on buss clock accuracy. For what it's worth, http://download.intel.com/embedded/software/IA/324264.pdf presents some recommendations for linux, but the authors detract from credibility by presenting confusion factors such as careless switching between IA-64 and Intel64 terminology. It's not at all clear how QueryPerformanceCounter is implemented, but it hides some annoying differences among CPU families and covers up synchronization problems, as well as eliminating the question of serialization, at large cost in overhead.
0 Kudos
Bernard
Valued Contributor I
813 Views
>>>. How many TSC registers exist on a multi-core system with many logical CPUs?>>> By writing logical CPU do you mean HT? >>>These ticks cannot be measured on a logical-processor basis.>>> You cannot sample HT logical cores. >>>What about a case when a system has many physical CPUs and every physical CPU has at least two logical CPUs?>>> If remember corrctly logical CPU is a HT logical core with reduced resources.Every HT core has an apic and gp registers ,but do not have vector SIMD units nor x87FPU unit.
0 Kudos
Bernard
Valued Contributor I
813 Views
>>>It's not at all clear how QueryPerformanceCounter is implemented,>>> QueryPerformanceCounter could be disassembled and statically or dynamically analyzed in order to understand its implementation.I suppose that this functions could use HPET timer.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
>>...The Intel CPUs we've seen share tsc resource between hyperthreads and share the buss time clock among cores... Thank you, Tim. This is what I wanted to understand. Unfortunately, Intel's manuals don't describe all TSC related issues in a multi-core environment.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
>>>>These ticks cannot be measured on a logical-processor basis. >> >>You cannot sample HT logical cores. I've created another test-case ( #4 ) and source codes will be provided.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
To Roman Oderov: Roman, I didn't try to synchronize RDTSC values for different CPU but I tried to evaluate delays during execution of two processes on two different logical CPUs. If you try to execute the Test-Case #4 you will get different numbers. Take into account that it is a non-deterministic test and results are always different.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
Here is a detailed high-level description of the Test-Case #4: - a computer system with Windows 32-bit OS has one physical CPU with two logical CPUs - an Event syncronization object is created in Non-Signaled state - two Threads '1' and '2' are created in Suspended state - execution of Threads Resumed but as soon as processing starts Threads wait for 5 seconds until the Event syncronization object changes its state to Signaled - threads affinity masks are set: Thread '1' is assigned to CPU1 and Thread '2' is assigned to CPU2 - priorities of current Process and Threads are changed to Real-Time - after a 5 seconds delay the state of the Event syncronization object is changed to Signaled - both threads are beginning processing ( almost at the same time! ) and they record 16 RDTSC values - for every RDTSC value an ID ( number of iteration ) is stored as well - when processing is completed all allocated resources ( handles ) closed - if there are no any processing errors some statistics is displayed - even if both threads are executed with Real-Time priorities on different logical CPUs there are always differences in RDTSC values for iterations with the same ID - a smallest difference I was able to record is ~708 nano-seconds ( 0.708 micro-seconds ) - a smallest average difference I was able to record is ~768.75 nano-seconds ( 0.76875 micro-seconds )
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
Source codes of the Test-case #4 attached.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
Processing report log-file attached. PS: This is how it looks like: Application - ScaLibTestApp - WIN32_MSC - Release Tests: Start > Test1017 Start < Sub-Test 59 ... Test-Case 4 - Retrieving RDTSC values for CPUs - 2 Threads 1 and 2 created Iteration Thread 1 Thread 2 Difference 00 11613344836800 11613344835972 -828 01 11613344836992 11613344836188 -804 02 11613344837112 11613344836404 -708 <= Smallest difference 03 11613344837232 11613344836500 -732 04 11613344837340 11613344836596 -744 05 11613344837460 11613344836704 -756 06 11613344837580 11613344836824 -756 07 11613344837688 11613344836932 -756 08 11613344837796 11613344837076 -720 09 11613344837904 11613344837172 -732 10 11613344838012 11613344837292 -720 11 11613344838156 11613344837412 -744 12 11613344838300 11613344837520 -780 13 11613344838444 11613344837640 -804 14 11613344838600 11613344837760 -840 15 11613344838744 11613344837868 -876 Statistics: Thread 1 started at 11613344836644 Thread 2 started at 11613344835096 Difference 1548 Thread 1 completed at 11613344838924 Thread 2 completed at 11613344838012 Difference 912 dwThreadAMPrev[0]: 3 ( Processing Error if 0 ) dwThreadAMPrev[1]: 3 ( Processing Error if 0 ) Test Completed in 19172 ticks > Test1017 End < Tests: Completed
0 Kudos
Roman_Oderov
Beginner
813 Views
To Sergey Kostrov: Thanks for the detailed description! Yes, I was just going to measure delays.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
One more note regarding negative values for differences: >>... >>Iteration Thread 1 Thread 2 Difference >>... >>02 11613344837112 11613344836404 -708 <= Smallest difference >>... A negative value -708 means that Thread '2' started first and Thread '1' started second. A Windows Tasks Scheduler starts threads one at a time.
0 Kudos
Bernard
Valued Contributor I
813 Views
>>>different logical CPUs.>>> What Do you mean by saying "Logical CPU"? I suppose that you are reffering to HT cores of multicore processor.Because logical processor can run concurrently threads which are not accessing x87 FPU and SIMD vector units.These logical cores(HT) have its own apic and gp and control registers state.
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
>>What Do you mean by saying "Logical CPU"? >>I suppose that you are reffering to HT cores of multicore processor... My development computer has one physical CPU and Windows Task Manager shows two CPUs ( logical ). Is there something wrong here?
0 Kudos
SergeyKostrov
Valued Contributor II
813 Views
To Roman Oderov: I wonder if you will be able to post results for the Test-Case #4. Also, in about 2-3 weeks I'll be able to execute these tests on a new computer with a 3rd generation Intel CPU.
0 Kudos
Roman_Oderov
Beginner
803 Views
Sergey, I'll try to post my results as soon as possible
0 Kudos
Bernard
Valued Contributor I
803 Views
>>> Is there something wrong here?>>> No it's ok:) I was thinking about newest Sandy Bridge CPU's which have multiple cores with two HT "units".I thought that you have such a CPU. If you are interested you can test HT scaling when you will have Sandy Bridge processor.Such a test could verify inabillity to scale very well when heavy-floating point calculation is involved and executed on single hyperthreaded core.
0 Kudos
Reply