Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
1093 Discussions

Is Haswell's new transactional memory 'TSX' actually slower than locking?

Elmar
Beginner
600 Views

Dear all,

just got my fingers on a Haswell system and tried the new TSX extension, hoping to boost performance of my multi-threaded app.

But what I found was rather shocking, the numbers are execution times in microseconds:

A) 29122 - App running with a single thread and without any locking

B) 42762 - Same as A) above, but just adding an XBEGIN/XEND pair (with nothing in between) at the critical sections. So even though I don't do any transaction yet, the code takes 46% longer to execute. That's much more than I had expected.

C) 50410- Like B) above, but now the XBEGIN/XEND is placed around the critical section, and a lock is acquired if the transaction in the critical section fails (i.e. that's now the way it's meant to be played, just running a single thread). Locking is done with a pause/lock cmpxchg spinloop.

D) 47591 - Dropping the XBEGIN/XEND and just using the old-fashioned pause/lock cmpxchg spinloop to protect each critical section. So TSX C) is slower than acquiring a lock in a single thread.

E) 10697 - Like C), using TSX but running 8 threads

F) 9935 - Like D), using old-fashioned locking and running 8 threads

Summary: no matter if I am using 1 or 8 threads, TSX takes ~7% longer than normal locking. Of course I am not talking about "beginner's sissy locking", where you have one single global lock, but about performance-tuned fine-grained locking.

Performance with XACQUIRE/XRELEASE prefixes was even worse, BTW.

So the question: has anyone of you managed to improve performance with TSX over old-fashioned (but fine-grained) locking?

Greetings,

Elmar

0 Kudos
6 Replies
Andreas_K_Intel
Employee
600 Views

If you have very small critical sections transactions can be slightly slower than locking. On larger critical sections this effect usually disappears. TSX tends to win per thread when you use larger critical sections ("lock coarsening") and thus less locks.

In general too tight critical sections can perform poorly even without lock elision, when there is any contention, as the communication overhead can start dominiating. For more details see

http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf

0 Kudos
SergeyKostrov
Valued Contributor II
600 Views
>>...If you have very small critical sections transactions can be slightly slower than locking. On larger critical sections >>this effect usually disappears... How do you define small critical sections or larger critical sections? As a number of code lines or something else?
0 Kudos
Andreas_K_Intel
Employee
600 Views

Sergey Kostrov wrote:

>>...If you have very small critical sections transactions can be slightly slower than locking. On larger critical sections
>>this effect usually disappears...

How do you define small critical sections or larger critical sections? As a number of code lines or something else?

Instructions, time. memory accesses

For contended conventional locks we usually recommend at least 200ns

0 Kudos
Andreas_K_Intel
Employee
600 Views

Sergey Kostrov wrote:

>>...If you have very small critical sections transactions can be slightly slower than locking. On larger critical sections
>>this effect usually disappears...

How do you define small critical sections or larger critical sections? As a number of code lines or something else?

Instructions, time. memory accesses

For contended conventional locks we usually recommend at least 200ns

0 Kudos
SergeyKostrov
Valued Contributor II
600 Views

Andreas, You're Not specific completely and you have not specified if 200ns is for small or large critical sections.

0 Kudos
jimdempseyatthecove
Honored Contributor III
600 Views

A good candidate for TSX/HLE would be non-Head/Tail insert or delete a node of linked list where formerly you had a single mutex to lock the entire list. Note, the node would require an in-use counter to avoid a race condition between find/acquire node and delete node. When using TSX/HLE then non-adjacent (non-head/tail) node insert/delete will not interfere with other activities (provided you do not also have a node count in header that is updated in the same critical section as the insert/delete (if you did then you would have conflicting critical sections).

Jim Dempsey

0 Kudos
Reply