TSX results - please explain

hasyed · ‎11-20-2014

I am using Roman Dementiev's code as a base and modifying it to determine if TSX operations are behaving according to expectations.

https://software.intel.com/en-us/blogs/2012/11/06/exploring-intel-transactional-synchronization-extensions-with-intel-software

I am counting the number of times that xbegin() returns successful, the number of times it aborts and the number of times that fallback lock is used.

If I increase the number of threads that are launched, the number of fallback lock's increases proportionally. Similarly, the number of aborts increases proportionally. Originally, two threads were launched. I add the following code.

HANDLE thread3 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)3, 0, NULL);
HANDLE thread4 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)4, 0, NULL);
..
..
WaitForSingleObject( thread3, INFINITE );
WaitForSingleObject( thread4, INFINITE );

--- --- ---

I was hoping to see similar results, when I increase the size of the account array.
I changed the code below and experimented with size of 10K, 100K and 500K

Accounts.resize(Accounts.size() + 1000), 0);

Surprisingly, the number of successful xbegin(), # of aborts or # of fallback lock don't change.

I was expecting that as the account size increases, there would be an increase in successful xbegin operations and a decrease in aborts and fallback locks. By increasing the array size, the chances of different threads accessing the same element in the array decreases, therefore, one should expect an increase in successful xbegin() returns.

I am hoping to get an explanation of why I am seeing this behavior.

Roman_D_Intel · ‎11-21-2014

Hi hasyed,

are you running it in SDE emulator or natively on a real Intel TSX capable processor?

Thanks,

Roman

andysem · ‎11-21-2014

I was expecting that as the account size increases, there would be an increase in successful xbegin operations and a decrease in aborts and fallback locks. By increasing the array size, the chances of different threads accessing the same element in the array decreases, therefore, one should expect an increase in successful xbegin() returns.

I think this expectation is not quite correct. AFAIU, on its first exit xbegin() returns positive result if the transaction can be started at all. It doesn't account for other transactions in flight on other cores, including their working sets. What matters though is whether the transaction completes successfully (i.e. if xend() is called and results in the transaction commit). For that to happen there are multiple conditions, one of which is the working set size. If it exceeds the hardware limit the transaction is always aborted. Also, depending on the memory access pattern, larger working sets may mean that there is higher probability of conflicts between the transaction and other memory accesses, including other transactions. Also, working with larger data typically takes longer times, which increases probability of preemption and, consequently, the transaction abort. So I'd say one should generally expect reduction of successful transactions rate as the working set gets bigger.

Roman_D_Intel · ‎11-21-2014

The small number of loops (10000) in the test were chosen to fit the speed of the emulator. Probably the working threads do not overlap much on real hardware because they are too short lived. Could you increase the number of loops significantly (10x, 100x..) and see if the array size makes a difference then? It did on my box..

Thanks,

Roman

Roman_D_Intel · ‎11-21-2014

did you add your own success/abort counting to the code or did you use CPU performance events to count? the latter is recommended.

hasyed · ‎11-21-2014

I didn't change much of the code. Just added counters.

while (1)
{
++nretries;
unsigned status = _xbegin();

   if (status == _XBEGIN_STARTED)
   {
      if (!fallBackLock.isLocked())
      {
         InterlockedIncrement64(&nXbegin);
         return; // successfully started transaction
      }

      // started transaction but someone executes the transaction section
      // non-speculatively (acquired the fall-back lock)
      _xabort(0xff); // abort with code 0xff
   }

// abort handler

InterlockedIncrement64(&naborted); // do abort statistics

//std::cout << "DEBUG: Transaction aborted "<< nretries <<" time(s) with the status "<< status << std::endl;

   // handle _xabort(0xff) from above
   if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff && !(status & _XABORT_NESTED))
   {

while(fallBackLock.isLocked())
_mm_pause(); // wait until lock is free

   }
   else if(!(status & _XABORT_RETRY))
          break; // take the fall-back lock if the retry abort flag is not set

   if(nretries >= max_retries)
      break; // too many retries, take the fall-back lock
   }

InterlockedIncrement64(&nFallbackLock);
fallBackLock.lock();
}

Roman_D_Intel · ‎11-21-2014

is nXbegin shared between threads? I guess so. The increment of shared variable should cause a lot of conflicts/aborts -> changing the array size becomes irrelevant.

Roman_D_Intel · ‎11-21-2014

BTW: SDE can show you the code lines (with call stacks) killing the transactions because if the conflict/contention. SDE options for this: -hsw -hle_enabled 1 -rtm-mode full -tsx_stats 1 -tsx_stats_call_stack 1

Sample output:

# STACK INFORMATION FOR CONTENTION ABORT KILLER IP: 0x0000000000400ddf

#-------------------------------------------------------------------------------------------------------------

# IP FUNCTION NAME FILE NAME LINE COLUMN

0x00007fe4cf526520 start_thread 0 0

0x00000000004015d6 worker /root/222222222/111111_tsx.c 56 0

0x0000000000400d78 function1_name /root/222222222/111111111_tsx.h 148 0

0x0000000000400ddf function2_name /root/222222222/111111111_tsx.h 159 0

hasyed · ‎11-21-2014

Yes nXbegin is shared between threads. "nXbegin" is a global variable just like "naborted" in your code. LONGLONG naborted = 0; LONGLONG nFallbackLock = 0; LONGLONG nXbegin = 0;

hasyed · ‎11-21-2014

Roman, Increasing the shared memory size did not proportionally increase the success rate of xbegin(). As you indicated, this may be due to using a shared counter and incrementing the shared counter in the thread. I changed the code to increment different counters in different threads but the results don't seem to change i.e. increasing the size of the Account array does not increase the success rate of xbegin() LONGLONG naborted1 = 0; LONGLONG naborted2 = 0; LONGLONG nFallbackLock1 = 0; LONGLONG nFallbackLock2 = 0; LONGLONG nXbegin1 = 0; LONGLONG nXbegin2 = 0; TransactionScope(SimpleSpinLock & fallBackLock_, int threadIndex, int max_retries = 3) : fallBackLock(fallBackLock_) { int nretries = 0; while(1) { ++nretries; unsigned status = _xbegin(); if (status == _XBEGIN_STARTED) { if (!fallBackLock.isLocked()) { if (threadIndex == 1) InterlockedIncrement64(&nXbegin1); else InterlockedIncrement64(&nXbegin2); return; // successfully started transaction } ... ... with "1" being passed as the argument in creating the 1st thread and "2" being passed in the 2nd thread. HANDLE thread1 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)1, 0, NULL);

hasyed · ‎11-21-2014

andysem,

"So I'd say one should generally expect reduction of successful transactions rate as the working set gets bigger."

My results indicate that the number of successful transactions remains the same even after varying the size of shared memory greatly (between 1K and 500K DWORDS). One could attribute that to incrementing the shared counter, but as I indicated in the previous post, incrementing different counters did not produce different results.

Roman_D_Intel · ‎11-21-2014

Conflict detection in Haswell microarchitecture is based on cache coherency protocols with granularity of a cache line size (64 bytes). So for the processor your statistic counters are still shared (all in the same 64 byte cache line). This is known as "false-sharing". To avoid false-sharing you can add padding structures char [64-sizeof(LONGLONG)] between the items. You can also move the counter increment operation after the xend instruction (after the transaction successfully commited: outside of the transaction). Conflict debugging is best with SDE...

hasyed · ‎11-21-2014

Roman,

I made the counters local variables of the thread, as you had initially suggested, and now the results are much more in line with my expectations.

I do get unexpected results every now and then, where most of the transactions fail or most of them work, but if I average it out, the number of successful transactions seem to be proportional to the size of shared memory.

Snippets of my changes:
--- --- ---

...
...

TransactionScope(SimpleSpinLock & fallBackLock_, LONGLONG *pnXbegin, LONGLONG *pnFallbackLock, int max_retries = 3) : fallBackLock(fallBackLock_)

...
...

unsigned __stdcall thread_worker(void * arg)
{
   int thread_nr = (int) arg;
   std::tr1::minstd_rand myRand(thread_nr);
   long int loops = 100000;
   int index;

LONGLONG nFallbackLock = 0;
LONGLONG nXbegin = 0;

   while(--loops)
   {
      {
         TransactionScope guard(globalFallBackLock, &nXbegin, &nFallbackLock, thread_nr);

...
...
}
}

Thanks for all your feedback.