- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am using Roman Dementiev's code as a base and modifying it to determine if TSX operations are behaving according to expectations.
I am counting the number of times that xbegin() returns successful, the number of times it aborts and the number of times that fallback lock is used.
If I increase the number of threads that are launched, the number of fallback lock's increases proportionally. Similarly, the number of aborts increases proportionally. Originally, two threads were launched. I add the following code.
HANDLE thread3 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)3, 0, NULL);
HANDLE thread4 = (HANDLE) _beginthreadex(NULL, 0, &thread_worker, (void *)4, 0, NULL);
..
..
WaitForSingleObject( thread3, INFINITE );
WaitForSingleObject( thread4, INFINITE );
--- --- ---
I was hoping to see similar results, when I increase the size of the account array.
I changed the code below and experimented with size of 10K, 100K and 500K
Accounts.resize(Accounts.size() + 1000), 0);
Surprisingly, the number of successful xbegin(), # of aborts or # of fallback lock don't change.
I was expecting that as the account size increases, there would be an increase in successful xbegin operations and a decrease in aborts and fallback locks. By increasing the array size, the chances of different threads accessing the same element in the array decreases, therefore, one should expect an increase in successful xbegin() returns.
I am hoping to get an explanation of why I am seeing this behavior.
- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi hasyed,
are you running it in SDE emulator or natively on a real Intel TSX capable processor?
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was expecting that as the account size increases, there would be an increase in successful xbegin operations and a decrease in aborts and fallback locks. By increasing the array size, the chances of different threads accessing the same element in the array decreases, therefore, one should expect an increase in successful xbegin() returns.
I think this expectation is not quite correct. AFAIU, on its first exit xbegin() returns positive result if the transaction can be started at all. It doesn't account for other transactions in flight on other cores, including their working sets. What matters though is whether the transaction completes successfully (i.e. if xend() is called and results in the transaction commit). For that to happen there are multiple conditions, one of which is the working set size. If it exceeds the hardware limit the transaction is always aborted. Also, depending on the memory access pattern, larger working sets may mean that there is higher probability of conflicts between the transaction and other memory accesses, including other transactions. Also, working with larger data typically takes longer times, which increases probability of preemption and, consequently, the transaction abort. So I'd say one should generally expect reduction of successful transactions rate as the working set gets bigger.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The small number of loops (10000) in the test were chosen to fit the speed of the emulator. Probably the working threads do not overlap much on real hardware because they are too short lived. Could you increase the number of loops significantly (10x, 100x..) and see if the array size makes a difference then? It did on my box..
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
did you add your own success/abort counting to the code or did you use CPU performance events to count? the latter is recommended.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I didn't change much of the code. Just added counters.
while (1)
{
++nretries;
unsigned status = _xbegin();
if (status == _XBEGIN_STARTED)
{
if (!fallBackLock.isLocked())
{
InterlockedIncrement64(&nXbegin);
return; // successfully started transaction
}
// started transaction but someone executes the transaction section
// non-speculatively (acquired the fall-back lock)
_xabort(0xff); // abort with code 0xff
}
// abort handler
InterlockedIncrement64(&naborted); // do abort statistics
//std::cout << "DEBUG: Transaction aborted "<< nretries <<" time(s) with the status "<< status << std::endl;
// handle _xabort(0xff) from above
if((status & _XABORT_EXPLICIT) && _XABORT_CODE(status)==0xff && !(status & _XABORT_NESTED))
{
while(fallBackLock.isLocked())
_mm_pause(); // wait until lock is free
}
else if(!(status & _XABORT_RETRY))
break; // take the fall-back lock if the retry abort flag is not set
if(nretries >= max_retries)
break; // too many retries, take the fall-back lock
}
InterlockedIncrement64(&nFallbackLock);
fallBackLock.lock();
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
is nXbegin shared between threads? I guess so. The increment of shared variable should cause a lot of conflicts/aborts -> changing the array size becomes irrelevant.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW: SDE can show you the code lines (with call stacks) killing the transactions because if the conflict/contention. SDE options for this: -hsw -hle_enabled 1 -rtm-mode full -tsx_stats 1 -tsx_stats_call_stack 1
Sample output:
# STACK INFORMATION FOR CONTENTION ABORT KILLER IP: 0x0000000000400ddf
#-------------------------------------------------------------------------------------------------------------
# IP FUNCTION NAME FILE NAME LINE COLUMN
0x00007fe4cf526520 start_thread 0 0
0x00000000004015d6 worker /root/222222222/111111_tsx.c 56 0
0x0000000000400d78 function1_name /root/222222222/111111111_tsx.h 148 0
0x0000000000400ddf function2_name /root/222222222/111111111_tsx.h 159 0
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
andysem,
"So I'd say one should generally expect reduction of successful transactions rate as the working set gets bigger."
My results indicate that the number of successful transactions remains the same even after varying the size of shared memory greatly (between 1K and 500K DWORDS). One could attribute that to incrementing the shared counter, but as I indicated in the previous post, incrementing different counters did not produce different results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Conflict detection in Haswell microarchitecture is based on cache coherency protocols with granularity of a cache line size (64 bytes). So for the processor your statistic counters are still shared (all in the same 64 byte cache line). This is known as "false-sharing". To avoid false-sharing you can add padding structures char [64-sizeof(LONGLONG)] between the items. You can also move the counter increment operation after the xend instruction (after the transaction successfully commited: outside of the transaction). Conflict debugging is best with SDE...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roman,
I made the counters local variables of the thread, as you had initially suggested, and now the results are much more in line with my expectations.
I do get unexpected results every now and then, where most of the transactions fail or most of them work, but if I average it out, the number of successful transactions seem to be proportional to the size of shared memory.
Snippets of my changes:
--- --- ---
...
...
TransactionScope(SimpleSpinLock & fallBackLock_, LONGLONG *pnXbegin, LONGLONG *pnFallbackLock, int max_retries = 3) : fallBackLock(fallBackLock_)
...
...
unsigned __stdcall thread_worker(void * arg)
{
int thread_nr = (int) arg;
std::tr1::minstd_rand myRand(thread_nr);
long int loops = 100000;
int index;
LONGLONG nFallbackLock = 0;
LONGLONG nXbegin = 0;
while(--loops)
{
{
TransactionScope guard(globalFallBackLock, &nXbegin, &nFallbackLock, thread_nr);
...
...
}
}
Thanks for all your feedback.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page