(also posted as a comment to a blog entry re tsx-tools by Andy)
I've just started playing around with the new TSX feature set.
I wrote a quick test with a loop over lock;xchgl and movl with and without HLE prefixes.
To my surprise, the version with HLE prefixes seems to be ~50% slower?
Is the test invalid/irrelevant for some reason?
Am I doing something wrong or is this expected?
The test was run on a MacBook Air with an i7-4650U 1.7 GHz (Haswell) CPU
Rolfs-MacBook-Air:tsx-tools ran$ ./has-tsx
The code enclosed below was compiled with:
Rolfs-MacBook-Air:ran ran$ clang -O4 -o tt tt.c -lc
Rolfs-MacBook-Air:ran ran$ time ./tt 1 100000000
Rolfs-MacBook-Air:ran ran$ time ./tt 2 100000000
Source code for tt.c is attached.
the two paragraphs at the beginning of section 12.5 unfortunately provide very little guidance as to when HLE could be useful from a performance perspective.
The overhead is "typically" amortized and hidden ... "certain sequences" may appear to exacerbate ... if the critical section is "very small and appear in tight loops" ... "realistic applications" do not "normally" ... The overhead is amortized in "larger" critical sections but will be exposed in "very small" critical sections.
I have been able to find a number of situations in real code where HLE doesn't seem to be applicable as measured on the i7-4650. I have yet to find a single situation where I can demonstrate a clear performance advantage of HLE. I suspect that some of the tests that I have performed may yield different results on a 4-core CPU and multi-socket machine respectively.
Maybe my results so far are due to too low real parallellism?
Are there any relevant benchmarks that can be shared with the community?
Jim Dempsey's comment re preemption is obviously still valid. It may be that you would like to take some overhead in some situations to get protection against the preemption problem.
In the meantime, I will continue to look for elision opportunities in the code bases that I have access to.
Elmar, thx for the update.
I'm still trying to get my head around how TSX should be used. So far I've been unable to find a case that demonstrates a substantial improvement over the traditional locking schemes.
I'm also curious as to why there is so little specific info available re TSX performance, latency and intended use cases.
I would assume that Intel will release more info later.
>>I'm still trying to get my head around how TSX should be used.
It may be beneficial to list situations where TSX should not be used, such as situations were a single lock instruction works (lock; xchgadd [loc],val) and where the result is NOT used as simple mutex. This also includes CAS and DCAS.
Rolf, Elmar, I think it would be beneficial to setup a representative test situation where the protected transactions have a fair amount of computation (memory access), but not pressing the limitations of the TSX/HLE buffering system. Have the inter-protected runtime short. Run with several competing threads. The test is to see if when multiple threads enter the same protected region and modify/read conflicting cache lines:
a) one thread is winner (presumably first to exit)
b) all threads are losers
Hopefully the answer is a.