- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I've done a bit of testing of the Threading Building Blocks library on my Core i7 and found that the queuing_mutex isn't any faster than either a simple exponential backoff lock or a ticket lock. So I wonder why that is and if there's any faster alternative out there (in particular for implementing fairly short but contested critical sections).
Is it no faster because it has a higher overhead and the effect of spinning on a local cache line is rather small for 8 threads and only becomes more significant with a higher number of cores? Or does the Core i7's snoop filter already compensate for not spinning on a local cache line for the other locks?
Any explanation and advice for creating a (very) high performance critical section would be much appreciated!
Thanks,
Nicolas
I've done a bit of testing of the Threading Building Blocks library on my Core i7 and found that the queuing_mutex isn't any faster than either a simple exponential backoff lock or a ticket lock. So I wonder why that is and if there's any faster alternative out there (in particular for implementing fairly short but contested critical sections).
Is it no faster because it has a higher overhead and the effect of spinning on a local cache line is rather small for 8 threads and only becomes more significant with a higher number of cores? Or does the Core i7's snoop filter already compensate for not spinning on a local cache line for the other locks?
Any explanation and advice for creating a (very) high performance critical section would be much appreciated!
Thanks,
Nicolas
Link Copied
26 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you want advice about TBB, the TBB forum is the place. I'm sure the people there have dealt with these questions. Does TBB have environment variables to adjust the mutex parameters?
Core i7 was designed to reduce the latency of locks, which may imply that it's not so critical how you go about it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - c0d1f1ed
Hi all,
I've done a bit of testing of the Threading Building Blocks library on my Core i7 and found that the queuing_mutex isn't any faster than either a simple exponential backoff lock or a ticket lock. So I wonder why that is and if there's any faster alternative out there (in particular for implementing fairly short but contested critical sections).
Is it no faster because it has a higher overhead and the effect of spinning on a local cache line is rather small for 8 threads and only becomes more significant with a higher number of cores? Or does the Core i7's snoop filter already compensate for not spinning on a local cache line for the other locks?
Any explanation and advice for creating a (very) high performance critical section would be much appreciated!
Thanks,
Nicolas
I've done a bit of testing of the Threading Building Blocks library on my Core i7 and found that the queuing_mutex isn't any faster than either a simple exponential backoff lock or a ticket lock. So I wonder why that is and if there's any faster alternative out there (in particular for implementing fairly short but contested critical sections).
Is it no faster because it has a higher overhead and the effect of spinning on a local cache line is rather small for 8 threads and only becomes more significant with a higher number of cores? Or does the Core i7's snoop filter already compensate for not spinning on a local cache line for the other locks?
Any explanation and advice for creating a (very) high performance critical section would be much appreciated!
Thanks,
Nicolas
If you are talking more than 8 HW threads you are talking multi-socket (e.g. Xeon X5570).
Witheach L1 shared by two HT threads, andtwo L2 caches each shared by four HT threads, andone L3 shared by all eightHT threads within each processorbut each eight HT threads of each processor not sharing cache of other processor(s) it becomes clearly apparent that not all threads are equal in accesses to mutexor interference with other threadswhen it comes to manipulating or testing a spinlock.
Oneway to reduce the latencies is to sub-divide the tasks and their accompanying spinlocks such that the reach to the spinlock is short (through cache system) with respect to other threads accessing that spinlock.
In order to accomplish this you will need a threading environment that can classify the hardware threads by cache distance and by doing so you can create subsets of the thread pool. Then for any given set of tasks using a particular mutex it may be advantageous to pick a sub-set of the thread pool whose members are within a selected cache reach distance of each other(as opposed to permitting any thread to grab the task).
There arefew experimental threading tools available that permit this degree of thread scheduling. One such tool that I am involved with is QuickThread. If this issue is critical to your requirements, and if you are willing to work with Alpha test software, then email me with your requirements and programming environment and then I could ascertain if you would be suitable as an Alpha test candidate.
Jim Demspey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not actually looking for advice on using TBB, but rather general information about various lock implementationsand why Intel processors behave they do. I only mentioned TBB because it has a queuing lock which is supposed to bea superioralgorithmbut I found no proof of that.But if you believe the TBB forum is a better place to get in-depth information on locking mechanisms and how cache coherency affects them then I'd be happy to post my question again there. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - c0d1f1ed
I'm not actually looking for advice on using TBB, but rather general information about various lock implementationsand why Intel processors behave they do. I only mentioned TBB because it has a queuing lock which is supposed to bea superioralgorithmbut I found no proof of that.But if you believe the TBB forum is a better place to get in-depth information on locking mechanisms and how cache coherency affects them then I'd be happy to post my question again there. Thanks.
What kind of a test did you use to measure performance (or rather overhead) of different locks?
Queuing mutex algorithm is fair, and that might cause problems with some tests that do not take this into account. For example, if you update some global variable inside the critical section protected by the queuing lock, each time this variable will be accessed and changed by a different thread (due to lockfairness) and so each time you have a miss on a certain cache level. Unfair mutex, especially in a test setup with little amount of work _outside_ of critical section, often is re-acquired back by the same thread that just released it, thus causing much less cache misses.
If you want to measure pure overhead of a mutex without side effects due tothe nature of the work inside critical section, this workshould also be "pure", such as changing a local, not sharedvariable (make it volatile so that the compiler won't optimize out accesses to it). Another good idea is to measure throughput (the number of operations in certain time) rather than wallclock time for a given amount of work. Third, a good benchmark should vary the amount of work both inside and outside the critical section, to vary the degree of contention and evaluate different scenarios of usage.
Unfortunately, lock fairness might cause performance issues in real applications as well, as those usually do change some shared data in critical sections. On the other hand, real applications rarely have the same degree of contention as micro-benchmarks to test a mutex, so fairness might be less of an issue; but then local spinning would equally give less benefit, if any.
Another problem of fair locksimplementations is convoying,especiallyin case of system oversubscription. With an unfair lock, it does not matter which threads execute at this very moment, because each one could equally well get into the critical section next. With a fair lock, if the thread that is next to acquire the mutex was preempted, no other thread could make any progress as well. Therefore good implementations of fair mutexes should yield from time to time during spinning. But equally good implementations of non-fair spin mutexes should yield, to reduce contention on the shared flag(s) used for spinning.
The bottom line is probably that fairness of a mutex has its cost, and that I am not sure whether queuing mutex algorithm is really superior to others.
Sorry for not saying any word about Intel Core i7 processor, nor other Intel processors :).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good point on the first-come-first-serve (i.e. fair mutex) with respect to preemption while waiting for mutex. Although this can at times have severe degredation, the driving reason for a fair mutex is to make certain that no thread, or sub-set of threads, will be held off from acquiring the mutex for an inordinately long time (in some cases never get access to the resource).
Another type of mutex would be a sequencing (or pecking order) mutex where each thread obtains a pecking order number and the fairness method is set to sequence by sequence number. Example, if using OpenMP thread team member number and team members 0,1 and 3 are to share mutex (not 2) then the pecking order could be 0,1,3,0,1,3,... This type of sequencing (fairness) can improve the probability of convoying. Although it is subject to sensitivity to thread preemption.
The choice of mutex is dependent on probabilities and needs. There is no one "is faster" method when you really are looking for "is best for this application".
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
If you are talking more than 8 HW threads you are talking multi-socket (e.g. Xeon X5570).
Witheach L1 shared by two HT threads, andtwo L2 caches each shared by four HT threads, andone L3 shared by all eightHT threads within each processorbut each eight HT threads of each processor not sharing cache of other processor(s) it becomes clearly apparent that not all threads are equal in accesses to mutexor interference with other threadswhen it comes to manipulating or testing a spinlock.
Oneway to reduce the latencies is to sub-divide the tasks and their accompanying spinlocks such that the reach to the spinlock is short (through cache system) with respect to other threads accessing that spinlock.
In order to accomplish this you will need a threading environment that can classify the hardware threads by cache distance and by doing so you can create subsets of the thread pool. Then for any given set of tasks using a particular mutex it may be advantageous to pick a sub-set of the thread pool whose members are within a selected cache reach distance of each other(as opposed to permitting any thread to grab the task).
There arefew experimental threading tools available that permit this degree of thread scheduling. One such tool that I am involved with is QuickThread. If this issue is critical to your requirements, and if you are willing to work with Alpha test software, then email me with your requirements and programming environment and then I could ascertain if you would be suitable as an Alpha test candidate.
Witheach L1 shared by two HT threads, andtwo L2 caches each shared by four HT threads, andone L3 shared by all eightHT threads within each processorbut each eight HT threads of each processor not sharing cache of other processor(s) it becomes clearly apparent that not all threads are equal in accesses to mutexor interference with other threadswhen it comes to manipulating or testing a spinlock.
Oneway to reduce the latencies is to sub-divide the tasks and their accompanying spinlocks such that the reach to the spinlock is short (through cache system) with respect to other threads accessing that spinlock.
In order to accomplish this you will need a threading environment that can classify the hardware threads by cache distance and by doing so you can create subsets of the thread pool. Then for any given set of tasks using a particular mutex it may be advantageous to pick a sub-set of the thread pool whose members are within a selected cache reach distance of each other(as opposed to permitting any thread to grab the task).
There arefew experimental threading tools available that permit this degree of thread scheduling. One such tool that I am involved with is QuickThread. If this issue is critical to your requirements, and if you are willing to work with Alpha test software, then email me with your requirements and programming environment and then I could ascertain if you would be suitable as an Alpha test candidate.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Alexey Kukanov (Intel)
Queuing mutex algorithm is fair...
Thanks for pointing that out, I didn't takeit into account. My primary goal for the lock is to write asort oftask scheduler. I currently don't care which thread gets the task (although Jim made it clear to me that a cache aware choice might be better). I assumed that a first-come-first-serve lock would automatically be high performance since the ones that come later don't fight over the lock ownership. But that's probably not true because they still fight to get added to the queue first?
I've also read something about hierarchical locks. In case of 8 threads there would be three 'levels' of sublocks and at each level there's only one other thread that can attempt to acquire the sublock simultaneously, making it faster overall. So far I haven't had any luck with such approach either though. But maybe I'm just confused about cache coherency and the mechanisms behind atomic operations... Do you know of any resources that explains these things in detail from a practical programming point of view (instead of a purely hardware point of view)? Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Fair mutexes are always slower (well, ok, not faster) than unfair mutexes. Additional guarantees do require additional overheads. It's like real-time systems are always slower than non real-time, or QoS is always slower than Best Effort.
However smart distributed design is different thing. Unfair mutex can be smart and distributed too.
I think that for short critical sections primitive spin-mutex is the best you can find, because additional smartness (read overheads) will not be compensated provided short critical sections.
But the problem is that no mutex is really fast and scalable, centralized shared structures are just non scalable. Period.
If you need performance and scalability you better look not at the "fast" mutexes, but at the techniques like privatization, partitioning, immutability, distribution, amortization, copy-on-write, partial-copy-on-write, lock-free reader pattern, etc.
However smart distributed design is different thing. Unfair mutex can be smart and distributed too.
I think that for short critical sections primitive spin-mutex is the best you can find, because additional smartness (read overheads) will not be compensated provided short critical sections.
But the problem is that no mutex is really fast and scalable, centralized shared structures are just non scalable. Period.
If you need performance and scalability you better look not at the "fast" mutexes, but at the techniques like privatization, partitioning, immutability, distribution, amortization, copy-on-write, partial-copy-on-write, lock-free reader pattern, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
But the problem is that no mutex is really fast and scalable, centralized shared structures are just non scalable. Period.
Not absolutely. Mutex implementations can be fast (i.e. have lowoverhead) and can bescalable (i.e. do not increase the overhead as the number of competing threads increase). But programs that use centralized shared structures protected by a mutex are non-scalable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>Mutex implementations can be fast (i.e. have lowoverhead) and can bescalable (i.e. do not increase the overhead as the number of competing threads increase).
When your mutex are partitioned such that no more than m threads access the mutex, regardless of the number of threads on the system then yes the mutex is scalable.
When the number of threads sharing the mutex increase with the number of threads in the system and when contention is low then it can be scalable. i.e. the CAS or other method to gain the lock has almost certainty of success.
When the number of threads sharing the mutex increase with the number of threads in the system and when contention is high then it may not necessarily be scalable. i.e. the CAS or other method to gain the lock has uncertainty of success.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"When your mutex are partitioned such that no more than m threads access the mutex, regardless of the number of threads on the system then yes the mutex is scalable."
Isn't any implementation of any data structure "scalable" in such situation? IMVHO, the term 'scalable' is basically senseless in such context. So I propose to define 'scalability' as performance as a function of a number of contending threads. So mutex can be non-scalable in such situation, but just nobody cares.
"When the number of threads sharing the mutex increase with the number of threads in the system and when contention is low then it can be scalable. i.e. the CAS or other method to gain the lock has almost certainty of success."
Ok, I see, according to your definition the most scalable data structure is the one that is not used in the program (is has the lowest contention) :)
I see your point that from an application point of view extreme scalability of data structure is just not always required. But in such situations people are not saying about performance and scalability... because it is not required. But since people are already talking about "very high performance critical section", I assume that they need indeed fast and scalable solutions.
Isn't any implementation of any data structure "scalable" in such situation? IMVHO, the term 'scalable' is basically senseless in such context. So I propose to define 'scalability' as performance as a function of a number of contending threads. So mutex can be non-scalable in such situation, but just nobody cares.
"When the number of threads sharing the mutex increase with the number of threads in the system and when contention is low then it can be scalable. i.e. the CAS or other method to gain the lock has almost certainty of success."
Ok, I see, according to your definition the most scalable data structure is the one that is not used in the program (is has the lowest contention) :)
I see your point that from an application point of view extreme scalability of data structure is just not always required. But in such situations people are not saying about performance and scalability... because it is not required. But since people are already talking about "very high performance critical section", I assume that they need indeed fast and scalable solutions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Ok, I see, according to your definition the most scalable data structure is the one that is not used in the program (is has the lowest contention) :)
That is the critical point.
There are two distinct levels of scalability
1) The outward appearance of the application as a whole (how well it runs on 16 cores vs 4 cores)
2) The inward effect on a given critical section
Division of labor (task) is an effective way to reduce, and in some cases, eliminate contention for a mutex.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
There are two distinct levels of scalability
1) The outward appearance of the application as a whole (how well it runs on 16 cores vs 4 cores)
2) The inward effect on a given critical section
Division of labor (task) is an effective way to reduce, and in some cases, eliminate contention for a mutex.
1) The outward appearance of the application as a whole (how well it runs on 16 cores vs 4 cores)
2) The inward effect on a given critical section
Division of labor (task) is an effective way to reduce, and in some cases, eliminate contention for a mutex.
Agree, partitioning, privatization and distribution (along with immutability, amortization, copy-on-write, partial-copy-on-write, lock-free reader pattern, etc.) is the only way to scalability. Sometimes they are just inherent to the application (think of the parallel quick sort - work is inherently distributed -> scalable), and sometimes they are not (thus we have to introduce them artificially - think of the memory allocator - we introduce per thread caches, "efficient" mutex are of no help here).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
When your mutex are partitioned such that no more than m threads access the mutex, regardless of the number of threads on the system then yes the mutex is scalable.
Nope.
Consider following test which mimics such situation. In test 1, 2 threads access var1. In test 2, 2 threads access var1 and concurrently other 2 threads access var2.
[cpp]#include#include #include #include struct data_t { char pad1 [128]; long var1; char pad2 [128]; long var2; char pad3 [128]; }; data_t g_data; size_t iter_count = 100000000; unsigned __stdcall thread_func(void* var) { for (size_t i = 0; i != iter_count; ++i) { _InterlockedIncrement((long*)var); for (size_t p = 0; p != 3; ++p) _mm_pause(); } return 0; } int main() { for (int test = 0; test != 3; ++test) { { HANDLE threads [2]; threads[0] = (HANDLE)_beginthreadex(0, 0, thread_func, &g_data.var1, CREATE_SUSPENDED, 0); threads[1] = (HANDLE)_beginthreadex(0, 0, thread_func, &g_data.var1, CREATE_SUSPENDED, 0); SetThreadAffinityMask(threads[0], 1 << 0); SetThreadAffinityMask(threads[1], 1 << 2); unsigned __int64 t1 = __rdtsc(); ResumeThread(threads[0]); ResumeThread(threads[1]); WaitForMultipleObjects(2, threads, 1, INFINITE); unsigned __int64 t2 = __rdtsc(); CloseHandle(threads[0]); CloseHandle(threads[1]); std::cout << "test 1: " << (t2 - t1) / iter_count << " cycles/oper" << std::endl; } { HANDLE threads [4]; threads[0] = (HANDLE)_beginthreadex(0, 0, thread_func, &g_data.var1, CREATE_SUSPENDED, 0); threads[1] = (HANDLE)_beginthreadex(0, 0, thread_func, &g_data.var1, CREATE_SUSPENDED, 0); threads[2] = (HANDLE)_beginthreadex(0, 0, thread_func2, &g_data.var2, CREATE_SUSPENDED, 0); threads[3] = (HANDLE)_beginthreadex(0, 0, thread_func2, &g_data.var2, CREATE_SUSPENDED, 0); SetThreadAffinityMask(threads[0], 1 << 0); SetThreadAffinityMask(threads[1], 1 << 2); SetThreadAffinityMask(threads[2], 1 << 1); SetThreadAffinityMask(threads[3], 1 << 3); unsigned __int64 t1 = __rdtsc(); ResumeThread(threads[0]); ResumeThread(threads[1]); ResumeThread(threads[2]); ResumeThread(threads[3]); WaitForMultipleObjects(4, threads, 1, INFINITE); unsigned __int64 t2 = __rdtsc(); CloseHandle(threads[0]); CloseHandle(threads[1]); CloseHandle(threads[2]); CloseHandle(threads[3]); std::cout << "test 2: " << (t2 - t1) / iter_count << " cycles/oper" << std::endl; } } }[/cpp]
I've run the code on quad-core Q6600. According to you, run-times of both tests must be equal. Here is results I've got:
test 1: 193 cycles/oper
test 2: 276 cycles/oper
test 1: 192 cycles/oper
test 2: 275 cycles/oper
test 1: 193 cycles/oper
test 2: 277 cycles/oper
Oops! What's happening?!
Even is variables (mutexes) are different, they implicitly share cache-coherence inter-connects. While we have few threads they are scalable, but when the number of threads increases they just saturate inter-connects and this brings the whole system to its knees.
When I've modified the test so that threads 0 and 1 access var1, thread 2 access var2, and thread 3 access var3, I've got following results:
test 1: 191 cycles/oper
test 2: 192 cycles/oper
test 1: 192 cycles/oper
test 2: 184 cycles/oper
test 1: 192 cycles/oper
test 2: 192 cycles/oper
As expected, now system is perfectly scalable, because threads 2 and 3 do not load cache-coherence inter-connects.
If I would have bigger and more distributed system, I would be able to construct more interesting tests (for example, some threads execute interlocked operations, and others access memory).
Some native (and naive) systems today tend to over-use atomic reference counting, this may have exactly the same results. Even if number of objects is large and processing is distributed along objects (contention on each object is small), they may saturate cache-coherence inter-connects and the system will become non-scalable.
That's why I prefer cache-coherence spare techniques, hot-paths of all operations must not produce any load on cache-coherence interconnects, I believe that it's the only way to many-core scalability.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dmitriy Vyukov
When your mutex are partitioned such that no more than m threads access the mutex, regardless of the number of threads on the system then yes the mutex is scalable.
Nope.
Nope.
One important case: if m == 1 then yes, it's scalable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dmitriy,
Thanks for taking the time and interest in writing this test.
There are three flaws with your test:
1) very minor flaw -remove "2" off of "thread_func2" (missing function)
2) Medium flaw - The code as written is sensitive to data placement
3) Major flaw - assumption that _mm_pause is equal for all threads.
I did some experimentation on my system (Q6600 Windows XP x64)
[cpp]---------- original struct struct data_t { char pad1 [128]; long var1; char pad2 [128]; long var2; }; x32 &var1 = 004191F0 &var2 = 00419274 test 1: 280 cycles/oper test 2: 304 cycles/oper test 1: 279 cycles/oper test 2: 314 cycles/oper test 1: 279 cycles/oper test 2: 314 cycles/oper x64 &var1 = 0000000140009230 &var2 = 00000001400092B4 test 1: 308 cycles/oper test 2: 315 cycles/oper test 1: 298 cycles/oper test 2: 316 cycles/oper test 1: 307 cycles/oper test 2: 316 cycles/oper ----------- edit to align struct --------- __declspec( align(128) ) struct data_t { long var1; char pad1 [128-sizeof(long)]; long var2; char pad2 [128-sizeof(long)]; }; x64 &var1 = 0000000140009200 &var2 = 0000000140009280 test 1: 308 cycles/oper test 2: 315 cycles/oper test 1: 308 cycles/oper test 2: 308 cycles/oper test 1: 308 cycles/oper test 2: 315 cycles/oper ------- edit to replace _mm_pause with computation ------- for (size_t i = 0; i != iter_count; ++i) { _InterlockedIncrement((long*)var); for (size_t p = 0; p != 3; ++p) if(sqrt((double)(i+p)) == 2.1) printf("boinkn"); } x64 &var1 = 0000000140009200 &var2 = 0000000140009280 test 1: 419 cycles/oper test 2: 421 cycles/oper test 1: 422 cycles/oper test 2: 420 cycles/oper test 1: 418 cycles/oper test 2: 420 cycles/oper ------------- restore struct to original ------------ ------------- (keep sqrt computation ---------------- struct data_t { char pad1 [128]; long var1; char pad2 [128]; long var2; }; x32 &var1 = 004191F0 &var2 = 00419274 test 1: 541 cycles/oper test 2: 537 cycles/oper test 1: 535 cycles/oper test 2: 537 cycles/oper test 1: 541 cycles/oper test 2: 537 cycles/oper x64 &var1 = 0000000140009230 &var2 = 00000001400092B4 test 1: 419 cycles/oper test 2: 422 cycles/oper test 1: 420 cycles/oper test 2: 419 cycles/oper test 1: 418 cycles/oper ?? x32 longer (x32 using FPU) [/cpp]
as you can see, using your original struct and _mm_pause I too experienced the performance difference on x32 code.
Running x64 code on same struct and _mm_pause was inconsistant with x32 run (closer to same run timews)
Tweakingalignment could produce same run times (x64)
Then reverting back to original struct (without __declspec and difference in padd)
** But replacing _mm_pause with small computation
Now x32 has equal run times for both test, and x64 has equal runtimes for both tests.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now I did some more testing while reducing the computation time of the dummy compute section (simply removed the sqrt). Now the times for test1 and test2 are skewed again.
It looks as though the results are inconclusive.
[cpp]Continued --------- removed sqrt ----------- if(((double)(i+p)) == 2.1) printf("boinkn"); struct data_t { char pad1 [128]; long var1; char pad2 [128]; long var2; }; x32 &var1 = 004071B8 &var2 = 0040723C test 1: 189 cycles/oper test 2: 293 cycles/oper test 1: 187 cycles/oper test 2: 297 cycles/oper test 1: 189 cycles/oper test 2: 297 cycles/oper x64 &var1 = 0000000140009230 &var2 = 00000001400092B4 test 1: 200 cycles/oper test 2: 239 cycles/oper test 1: 200 cycles/oper test 2: 230 cycles/oper test 1: 200 cycles/oper test 2: 238 cycles/oper ------------------------------- __declspec( align(128) ) struct data_t { long var1; char pad1 [128-sizeof(long)]; long var2; char pad2 [128-sizeof(long)]; }; if(((double)(i+p)) == 2.1) printf("boinkn"); x32 &var1 = 00407180 &var2 = 00407200 test 1: 189 cycles/oper test 2: 294 cycles/oper test 1: 189 cycles/oper test 2: 297 cycles/oper test 1: 189 cycles/oper test 2: 297 cycles/oper x64 &var1 = 0000000140009200 &var2 = 0000000140009280 test 1: 200 cycles/oper test 2: 238 cycles/oper test 1: 198 cycles/oper test 2: 237 cycles/oper test 1: 200 cycles/oper test 2: 239 cycles/oper [/cpp]
I think the true answer lies somewhere deeper into the internals of the cache sub-system (on this processor) to think we can get a definitive answer from running this simple test.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would appear that as you increase the probability of collision for mutex (in this case _InterlockedIncrement) that you do tend to see a skew from test1 to test2, although there are some placement cases that illustrate contrary to this assumption.
The original premis of my post was that it is advised to reduce the probability of collision through partitioning means or other programming means.
It is of additional interest that in this case we are not testing a mutex. Instead we are testing something less, an attomic add.
BTW MS VS 2005 VC++ x32 uses a LOCK XADD, x64 uses LOCK ADD
I cannot imaging that that would have the effect on the difference of the skew ratios between x32 and x64. It is likely the slight difference in the loop overhead. And this seems to have a drastic difference (in some cases ~40% difference).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - jimdempseyatthecove
It would appear that as you increase the probability of collision for mutex (in this case _InterlockedIncrement) that you do tend to see a skew from test1 to test2, although there are some placement cases that illustrate contrary to this assumption.
The original premis of my post was that it is advised to reduce the probability of collision through partitioning means or other programming means.
It is of additional interest that in this case we are not testing a mutex. Instead we are testing something less, an attomic add.
BTW MS VS 2005 VC++ x32 uses a LOCK XADD, x64 uses LOCK ADD
I cannot imaging that that would have the effect on the difference of the skew ratios between x32 and x64. It is likely the slight difference in the loop overhead. And this seems to have a drastic difference (in some cases ~40% difference).
Jim Dempsey
Now when we go back and change the affinities
[cpp]Change thread affinities Test 1 SetThreadAffinityMask(threads[0], 1 << 0); SetThreadAffinityMask(threads[1], 1 << 1); Test 2 SetThreadAffinityMask(threads[0], 1 << 0); SetThreadAffinityMask(threads[1], 1 << 1); SetThreadAffinityMask(threads[2], 1 << 2); SetThreadAffinityMask(threads[3], 1 << 3); No stall loop for (i = 0; i != iter_count; ++i) { _InterlockedIncrement((long*)var); } &var1 = 0041D000 &var2 = 0041D080 test 1: 86 cycles/oper test 2: 87 cycles/oper test 1: 86 cycles/oper test 2: 87 cycles/oper test 1: 86 cycles/oper test 2: 87 cycles/oper ------------------------------ Slight stall for (i = 0; i != iter_count; ++i) { _InterlockedIncrement((long*)var); if(((double)(i)) == 2.1) printf("boinkn"); } &var1 = 0041D000 &var2 = 0041D080 test 1: 89 cycles/oper test 2: 89 cycles/oper test 1: 89 cycles/oper test 2: 89 cycles/oper test 1: 89 cycles/oper test 2: 90 cycles/oper ----------------- Longer stall &var1 = 0041D000 &var2 = 0041D080 test 1: 91 cycles/oper test 2: 91 cycles/oper test 1: 91 cycles/oper test 2: 91 cycles/oper test 1: 91 cycles/oper test 2: 91 cycles/oper [/cpp]
We get a completely different picture
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So when the affinities are set such that the mutex lies within the same cache of the threads sharing the mutex then we see very good scaling of the access via LOCK xxx.
This is an additional benefit of having the capability of affinity based scheduling (as is available in QuickThread).
Jim Dempsey
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page