Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.

TSX results - please explain

hasyed
Beginner
634 Views

I am using Roman Dementiev's code as a base and modifying it to determine if TSX operations are behaving according to expectations.

https://software.intel.com/en-us/blogs/2012/11/06/exploring-intel-transactional-synchronization-extensions-with-intel-software

I am counting the number of times that xbegin() returns successful, the number of times it aborts and the number of times that fallback lock is used.

If I increase the number of threads that are launched, the number of fallback lock's increases proportionally. Similarly, the number of aborts increases proportionally. Originally, two threads were launched. I add the following code.


HANDLE thread3 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)3, 0, NULL);
HANDLE thread4 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)4, 0, NULL);
..
..
WaitForSingleObject( thread3, INFINITE );
WaitForSingleObject( thread4, INFINITE );

--- --- ---

I was hoping to see similar results, when I increase the size of the account array.
I changed the code below and experimented with size of 10K, 100K and 500K

Accounts.resize(Accounts.size() + 1000), 0);

Surprisingly, the number of successful xbegin(), # of aborts or # of fallback lock don't change.

I was expecting that as the account size increases, there would be an increase in successful xbegin operations and a decrease in aborts and fallback locks. By increasing the array size, the chances of different threads accessing the same element in the array decreases, therefore, one should expect an increase in successful xbegin() returns.

I am hoping to get an explanation of why I am seeing this behavior.

 

0 Kudos
12 Replies
Roman_D_Intel
Employee
634 Views

Hi hasyed,

are you running it in SDE emulator or natively on a real Intel TSX capable processor?

Thanks,

Roman

0 Kudos
andysem
New Contributor III
634 Views

I was expecting that as the account size increases, there would be an increase in successful xbegin operations and a decrease in aborts and fallback locks. By increasing the array size, the chances of different threads accessing the same element in the array decreases, therefore, one should expect an increase in successful xbegin() returns.

I think this expectation is not quite correct. AFAIU, on its first exit xbegin() returns positive result if the transaction can be started at all. It doesn't account for other transactions in flight on other cores, including their working sets. What matters though is whether the transaction completes successfully (i.e. if xend() is called and results in the transaction commit). For that to happen there are multiple conditions, one of which is the working set size. If it exceeds the hardware limit the transaction is always aborted. Also, depending on the memory access pattern, larger working sets may mean that there is higher probability of conflicts between the transaction and other memory accesses, including other transactions. Also, working with larger data typically takes longer times, which increases probability of preemption and, consequently, the transaction abort. So I'd say one should generally expect reduction of successful transactions rate as the working set gets bigger.

 

0 Kudos
Roman_D_Intel
Employee
634 Views

The small number of loops (10000) in the test were chosen to fit the speed of the emulator. Probably the working threads do not overlap much on real hardware because they are too short lived. Could you increase the number of loops significantly (10x, 100x..) and see if the array size makes a difference then? It did on my box..

Thanks,

Roman

0 Kudos
Roman_D_Intel
Employee
634 Views

did you add your own success/abort counting to the code or did you use CPU performance events to count? the latter is recommended.

0 Kudos
hasyed
Beginner
634 Views

I didn't change much of the code. Just added counters.

while (1)
{
   ++nretries;
   unsigned status = _xbegin();

   if (status == _XBEGIN_STARTED)
   {
      if (!fallBackLock.isLocked())
      {
         InterlockedIncrement64(&nXbegin);
         return; // successfully started transaction
      }

      // started transaction but someone executes the transaction section
      // non-speculatively (acquired the fall-back lock)
      _xabort(0xff); // abort with code 0xff
   }

   // abort handler

   InterlockedIncrement64(&naborted); // do abort statistics

   //std::cout << "DEBUG: Transaction aborted "<< nretries <<" time(s) with the status "<< status << std::endl;

   // handle _xabort(0xff) from above
   if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff && !(status & _XABORT_NESTED))
   {

      while(fallBackLock.isLocked())
         _mm_pause(); // wait until lock is free

   }
   else if(!(status & _XABORT_RETRY))
          break; // take the fall-back lock if the retry abort flag is not set

   if(nretries >= max_retries)
      break; // too many retries, take the fall-back lock
   }

   InterlockedIncrement64(&nFallbackLock);
   fallBackLock.lock();
}

0 Kudos
Roman_D_Intel
Employee
634 Views

is nXbegin shared between threads? I guess so. The increment of shared variable should cause a lot of conflicts/aborts -> changing the array size becomes irrelevant.

0 Kudos
Roman_D_Intel
Employee
634 Views

BTW: SDE can show you the code lines (with call stacks) killing the transactions because if the conflict/contention. SDE options for this: -hsw -hle_enabled 1 -rtm-mode full   -tsx_stats 1 -tsx_stats_call_stack 1 

Sample output:

# STACK INFORMATION FOR CONTENTION ABORT KILLER IP: 0x0000000000400ddf

#-------------------------------------------------------------------------------------------------------------

#               IP                       FUNCTION NAME     FILE NAME                           LINE    COLUMN

0x00007fe4cf526520                        start_thread                                            0        0

0x00000000004015d6                              worker     /root/222222222/111111_tsx.c          56        0

0x0000000000400d78             function1_name              /root/222222222/111111111_tsx.h      148        0

0x0000000000400ddf             function2_name              /root/222222222/111111111_tsx.h      159        0

0 Kudos
hasyed
Beginner
634 Views
Yes nXbegin is shared between threads. "nXbegin" is a global variable just like "naborted" in your code. LONGLONG naborted = 0; LONGLONG nFallbackLock = 0; LONGLONG nXbegin = 0;
0 Kudos
hasyed
Beginner
634 Views
Roman, Increasing the shared memory size did not proportionally increase the success rate of xbegin(). As you indicated, this may be due to using a shared counter and incrementing the shared counter in the thread. I changed the code to increment different counters in different threads but the results don't seem to change i.e. increasing the size of the Account array does not increase the success rate of xbegin() LONGLONG naborted1 = 0; LONGLONG naborted2 = 0; LONGLONG nFallbackLock1 = 0; LONGLONG nFallbackLock2 = 0; LONGLONG nXbegin1 = 0; LONGLONG nXbegin2 = 0; TransactionScope(SimpleSpinLock & fallBackLock_, int threadIndex, int max_retries = 3) : fallBackLock(fallBackLock_) { int nretries = 0; while(1) { ++nretries; unsigned status = _xbegin(); if (status == _XBEGIN_STARTED) { if (!fallBackLock.isLocked()) { if (threadIndex == 1) InterlockedIncrement64(&nXbegin1); else InterlockedIncrement64(&nXbegin2); return; // successfully started transaction } ... ... with "1" being passed as the argument in creating the 1st thread and "2" being passed in the 2nd thread. HANDLE thread1 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)1, 0, NULL);
0 Kudos
hasyed
Beginner
634 Views

andysem,

"So I'd say one should generally expect reduction of successful transactions rate as the working set gets bigger."

My results indicate that the number of successful transactions remains the same even after varying the size of shared memory greatly (between 1K and 500K DWORDS). One could attribute that to incrementing the shared counter, but as I indicated in the previous post, incrementing different counters did not produce different results.

0 Kudos
Roman_D_Intel
Employee
634 Views

Conflict detection in Haswell microarchitecture is based on cache coherency protocols with granularity of a cache line size (64 bytes). So for the processor your statistic counters are still shared (all in the same 64 byte cache line). This is known as "false-sharing". To avoid false-sharing you can add padding structures char [64-sizeof(LONGLONG)] between the items. You can also move the counter increment operation after the xend instruction (after the transaction successfully commited: outside of the transaction). Conflict debugging is best with SDE...

0 Kudos
hasyed
Beginner
634 Views

Roman,

I made the counters local variables of the thread, as you had initially suggested, and now the results are much more in line with my expectations.

I do get unexpected results every now and then, where most of the transactions fail or most of them work, but if I average it out, the number of successful transactions seem to be proportional to the size of shared memory.

Snippets of my changes:
--- --- ---

...
...

TransactionScope(SimpleSpinLock & fallBackLock_, LONGLONG *pnXbegin, LONGLONG *pnFallbackLock, int max_retries = 3) : fallBackLock(fallBackLock_)

...
...

unsigned __stdcall thread_worker(void * arg)
{
   int thread_nr = (int) arg;
   std::tr1::minstd_rand myRand(thread_nr);
   long int loops = 100000;
   int index;

   LONGLONG nFallbackLock = 0;
   LONGLONG nXbegin = 0;

   while(--loops)
   {
      {
         TransactionScope guard(globalFallBackLock, &nXbegin, &nFallbackLock, thread_nr);

...
...
      }
   }

Thanks for all your feedback.

0 Kudos
Reply