Solved: Alexander,

Alexander_K_7 · ‎11-11-2013

Dear all,

I've made studying of Intel TSX performance - its abort cases and comparison with spin lock. The study with reference to source code is available at http://natsys-lab.blogspot.ru/2013/11/studying-intel-tsx-performance.html .

I see some performance gain for TSX in comparison with spin lock. However I stll have few of questions:

1. I see huge jump of transactional aborts when transaction work set reaches 256 cache lines (Figure 1). This is 16KB which is just only a quarter of L1d cache. The workload is single threaded, running strictly on one CPU core. Also I didn't use HyperThreading. I know that the cache has 8-way associativity, however I used static memory allocations (which are continous in virtual memory and likely continous physically) and it's unlikely that I have too many collisions in cache lines to get only 16KB of available cache memory for the transaction. So how the spike can be explained? Also Figure 3 (dependency of execution time on transaction size) shows strange fluctuation around 256 cache lines. Why execution time firstly quickly raises, then jumps down and after that smoothly increases again? There is no other such fluctuations on the graph.

2. Test with overlapping data has shown very confusing results: TSX performance doesn't show peformance increasing for transactions with small data contention in comparison with transactions which modify the same data set on different CPUs. Probably I'm wrong with my fallback code and TSX transaction always finds the lock acquired. But I did special steps (retry transaction on _XABORT_RETRY, _XABORT_CONFLICT and _XABORT_EXPLICIT with my abort code events) to retry transaction more times instead of fallback to spin lock. In this scenario I'd expect smaller number of aborts for transactions with no data ovelapping in comparison with transactions with tight contention, but abort rates are the same. Why?

3. And finally, I was experimenting with fallback condition a lot and results in the post are the best. How can I reduce number of transaction aborts (especially I'm interested in conflict aborts for non-overlapping or with very small data overlapping transactions) and increase overall transactional concurrency?

Many thank in advance,

Alexander.

Andreas_K_Intel · ‎12-15-2013

>To remove the mutex overhead of an uncontested transaction.

The important thing to keep in mind. uncongested for lock elision has a different meaning than for conventional locks. With lock elision congestion means congested access to the same 64 byte region (cache line) inside the lock. With conventional locks it means any use of the lock.

View solution in original post

jimdempseyatthecove · ‎11-11-2013

Caution - speculation....

The aborts, running single hardware thread, could be caused by lack of internal resource (buffer space to hold the transactions) or potentially by something unexpected such as an interrupt interfering with the thread execution within the protected region. Running VTune or other high frequency counter may be responsible for interruption (Intel PCM tool?).

Jim Dempsey

Alexander_K_7 · ‎11-15-2013

Hi Jim,

thank you for the reply.

Actually all the benchmarks were performed without any profiling tools. I used PCM only for separate run to understand the benchmark results. There are also cases with very small buffer size, but longer transaction (with just more loops). E.g. Figure 4 - the transactions were performed only for updates for 2 cache lines, but many updates. Certainly, when we increase transaction time, then we also increas probablity for some bad events (e.g. current thread preemption).

However, I still don't understand why the transaction size is limited by only quarter of L1d cache and why there is no difference in performance results for overlapping and non-overlaping data updates (in two threads/cpus).

Andreas_K_Intel · ‎11-17-2013

1. Capacity aborts: You are essentially trying to reverse engineer a detailed model of the cache behavior. Caches behavior of modern high performance CPUs is complicated. Even if you had such a model (which would be necessarily very simplified and approximate) it's unlikely it would really apply to predict the transaction success of specific applications, which are usually far more complex than micro benchmarks.

A better approach is to just run a real workload with lock elision and see what happens, then tune it to minimize aborts if needed. This way the real CPU can serve as your model. The point of TSX is not to run micro benchmarks, but to accelerate real applications.

2/3.

The optimization manual describes a range of techniques to improve lock fallbacks, see 12.3.5 in

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

In some cases it can also help to add randomized backoffs.

-Andi

Alexander_K_7 · ‎11-19-2013

Hi Andreas,

thank you for your reply. Surely I've read the optimization guide during experimenting with TSX, but it provides very generic suggestions which don't help too much even for optimizing microbenchmarks which I made...

> In some cases it can also help to add randomized backoffs.

Good point. Also probably combination of fine grained spinlocks as a backoff for transactions will make the better results.

Andreas_K_Intel · ‎11-22-2013

For RTM the most basic tuning technique is to retry a few times for conflicts and to wait for the lock to become free again before retrying. A additional randomized backoff can also improve it in some cases (but has the disadvantage of having magic numbers). This avoids the "Lemming effect" for retries, that is different threads continuously aborting each other.

Alexander_K_7 · ‎12-13-2013

I've made further investigations and here are the results (I've updated the blog post, so details are available there).

1. There was a mistake in conclusions, so single threaded TSX workload is actually limited by full L1d cache size (32KB on my machine);

2. I tried to commented out fallback to spin lock in overlapping test, so TSX transactions work fully with non-overlapped data (fallback spin lock is also can be seen as shared by CPUs cache line). Unfortunately, I still don't see any dependence of TSX performance time on whether transactions work with fully separate cache lines or on the same memory area. This is very confusing...

3. I tried to wait while spin lock is released when a transaction aborts due to acquired spin lock. The wait is done outside the transaction. There is a point where TSX becomes slower than spin lock, 64 cache lines, and at this point TSX performs much better (about 30% faster), but still slower than fully spinlock'ed implementation;

4. Probably my random fallbacks are not perfect (and I would be happy if you could review the code, here it is just in case https://github.com/krizhanovsky/NatSys-Lab/blob/master/tsx.cc), but it seems that it only slows the program down. I do fallbacks when transaction aborts from 0 to 63 times. If I increase the values at which transaction go to fallbacking spin lock, then the performance goes even worse.

Probably there are ways to optimize TSX performance in particular case. However TSX performance on non-overlapped data is very strange. Why it perform almost the same as on shared data? Why does it slower than spin lock?

jimdempseyatthecove · ‎12-15-2013

Alexander,

I think you have to view TSX from the perspective of one of its principal design goal:

To remove the mutex overhead of an uncontested transaction.

This may represent the vast majority of the situations where a mutex is required.

In situations of highly contested access, TSX, mutex, and even SpinLock are likely equally not suitable for the situation. IOW you will likely need to architect a completely different solution that does not rely on interactions of threads and cache/RAM transactions.

Obviously, one would like to have a full understanding of the options available, TSX being one of them, so your queries are valuable. Be mindful that TSX is in its first generation. Future improved generations are sure to follow.

In briefly looking at your code, the only comment I have is that your test system is likely not representative of the systems targeted for TSX. Your systems has 2 cores, 4 threads. Though you are not a server with 8 processors, 8 cores per processor, 2 HT per core, I suggest you expand your test to use all 4 hardware threads. The O/S does not context switch the HT's in a core so you can run 4 threads concurrently with little context interference (system clock ticks, etc...).

Jim Dempsey

Andreas_K_Intel · ‎12-15-2013

>To remove the mutex overhead of an uncontested transaction.

The important thing to keep in mind. uncongested for lock elision has a different meaning than for conventional locks. With lock elision congestion means congested access to the same 64 byte region (cache line) inside the lock. With conventional locks it means any use of the lock.

jimdempseyatthecove · ‎12-30-2013

Andreas,

Lock elision is more involved than what you state. Also conventional locks have side effects that you haven't stated. The design goal of "lock elision" is stated in its name: (under the right circumstances) lock removal. The implementation is cache line granular since this is the granularity of the cache coherency system. The overhead of a lock, for example an LOCK; XCHGADD can be 100's of time larger than the code execution time inside the protected region (e.g. accumulation into a protected/shared variable). When the protected area access is uncontested, then having the TSX/RTM remove the lock, reclaims the overhead expense of the lock. Under contested circumstances TSX/RTM may be more costly than conventional lock. Therefore, individual circumstances may dictate choice of lock. BTW, highly contested protected regions should be reworked to avoid this bottleneck.

Jim Dempsey

Studying Intel TSX Performance: strange results