- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone!
I have to synchronize time between processors in a multicore system i.e. I have to calculate TSC differences of all processors relative to one of them.
I tried rdtsc() but it returned TSC of the current processor. Is there any way to get TSC from the necessary processor? Or may be I can define processor id somewhere and use an appropriate time stamp counter value.
Thanks in advance,
Roman
Link Copied
76 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let's assume a call to a Win32 API function 'QueryPerformanceCounter' has to be done on a multi-core system. What value is it going to return? A TSC of CPU1, CPU2, CPU3, etc? Let's also assume that I don't set a CPU for execution explicitly.
I'll do another set of tests and I will try to predict a TSC value for a CPU2, for example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
'...Intel guarantees that the time-stamp counter will not wraparound within 10 years after being reset...'
I've calculated for a CPU with 3GHz clock speed a wraparound has to be ~194 years.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>>Let's assume a call to a Win32 API function 'QueryPerformanceCounter' has to be done on a multi-core system. What value is it going to return? A TSC of CPU1, CPU2, CPU3, etc? Let's also assume that I don't set a CPU for execution explicitly.>>>
Probably TSC value of the CPU which executes current context thread which is in turn executing machine code of "QueryPerformanceCounter" .So it could be an arbitrary CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...Well, in older processors I can't still rely on TSC's of different cores without manual synchronization...
What about a RESET signal that sets a TSC to 0? Does it mean that on a multi-core system the RESET signal occurs at different times for different CPUs?
- T0 for CPU0
- T0+some-delay1 for CPU1
- T0+some-delay2 for CPU2, etc?
How is it possible? Could Intel Hardware Engineers clearly explain it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Probably TSC value of the CPU which executes current context thread which is in turn executing machine code of
>>"QueryPerformanceCounter" .So it could be an arbitrary CPU.
I agree with that. However, I can't find in Intel manuals any explanations for:
1. How many TSC registers exist on a multi-core system with many logical CPUs? Is it just one and which is shared between all logical cores? ( in that case TSCs are synchronized by default )
2. Does every logical CPU have its own independent TSC register? Could different TSCs have different values at some time Tn?
3. What about a case when a system has many physical CPUs and every physical CPU has at least two logical CPUs?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Intel CPUs we've seen share tsc resource between hyperthreads and share the buss time clock among cores. Synchronization between sockets depends on action taken by the OS and on buss clock accuracy. For what it's worth, http://download.intel.com/embedded/software/IA/324264.pdf presents some recommendations for linux, but the authors detract from credibility by presenting confusion factors such as careless switching between IA-64 and Intel64 terminology.
It's not at all clear how QueryPerformanceCounter is implemented, but it hides some annoying differences among CPU families and covers up synchronization problems, as well as eliminating the question of serialization, at large cost in overhead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>. How many TSC registers exist on a multi-core system with many logical CPUs?>>>
By writing logical CPU do you mean HT?
>>>These ticks cannot be measured on a logical-processor basis.>>>
You cannot sample HT logical cores.
>>>What about a case when a system has many physical CPUs and every physical CPU has at least two logical CPUs?>>>
If remember corrctly logical CPU is a HT logical core with reduced resources.Every HT core has an apic and gp registers ,but do not have vector SIMD units nor x87FPU unit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>It's not at all clear how QueryPerformanceCounter is implemented,>>>
QueryPerformanceCounter could be disassembled and statically or dynamically analyzed in order to understand its implementation.I suppose that this functions could use HPET timer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>...The Intel CPUs we've seen share tsc resource between hyperthreads and share the buss time clock among cores...
Thank you, Tim. This is what I wanted to understand. Unfortunately, Intel's manuals don't describe all TSC related issues in a multi-core environment.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>>These ticks cannot be measured on a logical-processor basis.
>>
>>You cannot sample HT logical cores.
I've created another test-case ( #4 ) and source codes will be provided.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To Roman Oderov:
Roman, I didn't try to synchronize RDTSC values for different CPU but I tried to evaluate delays during execution of two processes on two different logical CPUs. If you try to execute the Test-Case #4 you will get different numbers. Take into account that it is a non-deterministic test and results are always different.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a detailed high-level description of the Test-Case #4:
- a computer system with Windows 32-bit OS has one physical CPU with two logical CPUs
- an Event syncronization object is created in Non-Signaled state
- two Threads '1' and '2' are created in Suspended state
- execution of Threads Resumed but as soon as processing starts Threads wait for 5 seconds until the Event syncronization object changes its state to Signaled
- threads affinity masks are set: Thread '1' is assigned to CPU1 and Thread '2' is assigned to CPU2
- priorities of current Process and Threads are changed to Real-Time
- after a 5 seconds delay the state of the Event syncronization object is changed to Signaled
- both threads are beginning processing ( almost at the same time! ) and they record 16 RDTSC values
- for every RDTSC value an ID ( number of iteration ) is stored as well
- when processing is completed all allocated resources ( handles ) closed
- if there are no any processing errors some statistics is displayed
- even if both threads are executed with Real-Time priorities on different logical CPUs there are always differences in RDTSC values for iterations with the same ID
- a smallest difference I was able to record is ~708 nano-seconds ( 0.708 micro-seconds )
- a smallest average difference I was able to record is ~768.75 nano-seconds ( 0.76875 micro-seconds )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Processing report log-file attached.
PS: This is how it looks like:
Application - ScaLibTestApp - WIN32_MSC - Release
Tests: Start
> Test1017 Start <
Sub-Test 59
...
Test-Case 4 - Retrieving RDTSC values for CPUs - 2
Threads 1 and 2 created
Iteration Thread 1 Thread 2 Difference
00 11613344836800 11613344835972 -828
01 11613344836992 11613344836188 -804
02 11613344837112 11613344836404 -708 <= Smallest difference
03 11613344837232 11613344836500 -732
04 11613344837340 11613344836596 -744
05 11613344837460 11613344836704 -756
06 11613344837580 11613344836824 -756
07 11613344837688 11613344836932 -756
08 11613344837796 11613344837076 -720
09 11613344837904 11613344837172 -732
10 11613344838012 11613344837292 -720
11 11613344838156 11613344837412 -744
12 11613344838300 11613344837520 -780
13 11613344838444 11613344837640 -804
14 11613344838600 11613344837760 -840
15 11613344838744 11613344837868 -876
Statistics:
Thread 1 started at 11613344836644
Thread 2 started at 11613344835096
Difference 1548
Thread 1 completed at 11613344838924
Thread 2 completed at 11613344838012
Difference 912
dwThreadAMPrev[0]: 3 ( Processing Error if 0 )
dwThreadAMPrev[1]: 3 ( Processing Error if 0 )
Test Completed in 19172 ticks
> Test1017 End <
Tests: Completed
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To Sergey Kostrov:
Thanks for the detailed description!
Yes, I was just going to measure delays.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One more note regarding negative values for differences:
>>...
>>Iteration Thread 1 Thread 2 Difference
>>...
>>02 11613344837112 11613344836404 -708 <= Smallest difference
>>...
A negative value -708 means that Thread '2' started first and Thread '1' started second. A Windows Tasks Scheduler starts threads one at a time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>different logical CPUs.>>>
What Do you mean by saying "Logical CPU"?
I suppose that you are reffering to HT cores of multicore processor.Because logical processor can run concurrently threads which are not accessing x87 FPU and SIMD vector units.These logical cores(HT) have its own apic and gp and control registers state.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>What Do you mean by saying "Logical CPU"?
>>I suppose that you are reffering to HT cores of multicore processor...
My development computer has one physical CPU and Windows Task Manager shows two CPUs ( logical ). Is there something wrong here?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To Roman Oderov:
I wonder if you will be able to post results for the Test-Case #4.
Also, in about 2-3 weeks I'll be able to execute these tests on a new computer with a 3rd generation Intel CPU.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey, I'll try to post my results as soon as possible
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> Is there something wrong here?>>>
No it's ok:)
I was thinking about newest Sandy Bridge CPU's which have multiple cores with two HT "units".I thought that you have such a CPU.
If you are interested you can test HT scaling when you will have Sandy Bridge processor.Such a test could verify inabillity to scale very well
when heavy-floating point calculation is involved and executed on single hyperthreaded core.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page