Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
Welcome to the Intel Community. If you get an answer you like, please mark it as an Accepted Solution to help others. Thank you!

Overhead of HLE acquire and release



(also posted as a comment to a blog entry re tsx-tools by Andy)

I've just started playing around with the new TSX feature set.

I wrote a quick test with a loop over lock;xchgl and movl with and without HLE prefixes.
To my surprise, the version with HLE prefixes seems to be ~50% slower?
Is the test invalid/irrelevant for some reason?
Am I doing something wrong or is this expected?



The test was run on a MacBook Air with an i7-4650U 1.7 GHz (Haswell) CPU

tsx-tools reports:
Rolfs-MacBook-Air:tsx-tools ran$ ./has-tsx
RTM: Yes
HLE: Yes
Rolfs-MacBook-Air:tsx-tools ran$

The code enclosed below was compiled with:
Rolfs-MacBook-Air:ran ran$ clang -O4 -o tt tt.c -lc

Rolfs-MacBook-Air:ran ran$ time ./tt 1 100000000

real 0m1.616s
user 0m1.612s
sys 0m0.004s
Rolfs-MacBook-Air:ran ran$ time ./tt 2 100000000

real 0m1.063s
user 0m1.061s
sys 0m0.002s
Rolfs-MacBook-Air:ran ran$

Source code for tt.c is attached.

0 Kudos
24 Replies

the two paragraphs at the beginning of section 12.5 unfortunately provide very little guidance as to when HLE could be useful from a performance perspective.

The overhead is "typically" amortized and hidden ... "certain sequences" may appear to exacerbate ... if the critical section is "very small and appear in tight loops" ... "realistic applications" do not "normally" ... The overhead is amortized in "larger" critical sections but will be exposed in "very small" critical sections.

I have been able to find a number of situations in real code where HLE doesn't seem to be applicable as measured on the i7-4650. I have yet to find a single situation where I can demonstrate a clear performance advantage of HLE. I suspect that some of the tests that I have performed may yield different results on a 4-core CPU and multi-socket machine respectively.

Maybe my results so far are due to too low real parallellism?

Are there any relevant benchmarks that can be shared with the community?

Jim Dempsey's comment re preemption is obviously still valid. It may be that you would like to take some overhead in some situations to get protection against the preemption problem.

In the meantime, I will continue to look for elision opportunities in the code bases that I have access to.



Hi Rolf,

same surprise for me: when I use TSX instead of normal locking, my app runs ~7% slower.

I posted some details here:




Elmar, thx for the update.

I'm still trying to get my head around how TSX should be used. So far I've been unable to find a case that demonstrates a substantial improvement over the traditional locking schemes.

I'm also curious as to why there is so little specific info available re TSX performance, latency and intended use cases.

I would assume that Intel will release more info later.



Black Belt

>>I'm still trying to get my head around how TSX should be used.

It may be beneficial to list situations where TSX should not be used, such as situations were a single lock instruction works (lock; xchgadd [loc],val) and where the result is NOT used as simple mutex. This also includes CAS and DCAS.

Rolf, Elmar, I think it would be beneficial to setup a representative test situation where the protected transactions have a fair amount of computation (memory access), but not pressing the limitations of the TSX/HLE buffering system. Have the inter-protected runtime short. Run with several competing threads. The test is to see if when multiple threads enter the same protected region and modify/read conflicting cache lines:
a) one thread is winner (presumably first to exit)
b) all threads are losers

Hopefully the answer is a.

Jim Dempsey