TSX vs. LOCK-instruction

Montero__Bonita · ‎09-26-2019

I've developed a LRU-cache that ist mostly lock-free, i.e. fetches that result in a cache hit can update the LRU-links in parallel.
A fetch that results in a LRU-hit to be spilled to the top of the LRU-links needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.
So what I'm thinking about is if it might me faster to replace these CMPXCHGs with RTM transactional memory. So has anyone here some performance-metrics if that could be advantageous?

Montero__Bonita · ‎09-27-2019

Can anyone here compile and run this code to check the performance of a simlple TSX-update on a size_t-variable against an atomic LOCK XADD or LOCK CMPXCHG? I only have a Ryzen and an older Xeon without TSX.

#if defined(_MSC_VER)
	#include <Windows.h>
	#include <intrin.h>
#elif defined(__unix__)
	#include <sys/sysinfo.h>
	#include <sched.h>
	#include <pthread.h>
	#include <immintrin.h>
#endif
#include <iostream>
#include <thread>
#include <cstddef>
#include <atomic>
#include <functional>
#include <chrono>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <array>

bool hasTSX();

using namespace std;
using namespace chrono;

inline
size_t fetchAdd( size_t volatile &v, size_t a )
{
#if defined(_MSC_VER)
	#if defined(_M_X64)
	return (size_t)_InterlockedExchangeAdd64( &(__int64 &)v, (__int64)a );
	#elif defined(_M_IX86)
	return (size_t)_InterlockedExchangeAdd( &(long &)v, (long)a );
	#else
		#error unsupported architecture
	#endif
#elif defined(__GNUC__) || defined(__clang__)
	return __sync_fetch_and_add( &v, a );
#else
		#error unsupported architecture
#endif
}

inline
bool rtmFetchAdd( size_t volatile &v, size_t a )
{
	if( _xbegin() == _XBEGIN_STARTED )
	{
		v += a;
		_xend();
		return true;
	}
	else
		return false;
}

inline
size_t compareExchange( size_t volatile &v, size_t c, size_t x )
{
#if defined(_MSC_VER)
	#if defined(_M_X64)
	return (size_t)_InterlockedCompareExchange64( &(__int64 &)v, (__int64)x, (__int64)c );
	#elif defined(_M_IX86)
	return (size_t)_InterlockedCompareExchange( &(long &)v, (long)x, (long)c );
	#else
		#error unsupported architecture
	#endif
#elif defined(__GNUC__) || defined(__clang__)
	return __sync_val_compare_and_swap( &v, c, x );
#else
		#error unsupported architecture
#endif
}

int main( int argc, char **argv )
{
	if( argc < 2 )
		return -1;
	double nsPerClockCycle = 1.0 / (atof( argv[1] ) * 1.0e9);

	auto thrXadd = []( uint8_t volatile &run, size_t adds, size_t volatile &atm, atomic<size_t> &misses )
	{
		while( !run );
		for( size_t i = adds; i; --i )
			fetchAdd( atm, 1 );
	};
	auto thrXchg = []( uint8_t volatile &run, size_t adds, size_t volatile &atm, atomic<size_t> &misses )
	{
		while( !run );
		size_t missed = 0;
		for( size_t i = adds, cmp = atm; i; --i )
		{
			for( size_t res; ; )
				if( (res = compareExchange( atm, cmp, cmp + 1 )) == cmp )
				{
					cmp = cmp + 1;
					break;
				}
				else
					cmp = res,
					++missed;
		}
		misses.fetch_add( missed );
	};
	auto rtmAdd = []( uint8_t volatile &run, size_t adds, size_t volatile &atm, atomic<size_t> &misses )
	{
		while( !run );
		size_t missed = 0;
		for( size_t i = adds; i; --i )
			while( !rtmFetchAdd( atm, 1 ) )
				++missed;
		misses.fetch_add( missed );
	};
	using threadfunc = void (*)( uint8_t volatile &, size_t, size_t volatile &, atomic<size_t> & );
	array<threadfunc, 3>   atf;
	array<char const *, 3> threadDescr;
	size_t                 nTests;
	size_t const           ADDS = 10'000'000;
	unsigned               nProcessors = thread::hardware_concurrency();

	atf[0]         = thrXadd;
	atf[1]         = thrXchg;
	atf[2]         = rtmAdd;
	threadDescr[0] = "xadd-thread";
	threadDescr[1] = "cmpxchge-thread";
	threadDescr[2] = "rtm-thread";
	nTests         = hasTSX() ? atf.size() : atf.size() - 1;

	for( size_t m = 0; m != nTests; ++m )
	{
		cout << threadDescr << ":" << endl;
		for( unsigned nThreads = 1; nThreads <= nProcessors; ++nThreads )
		{
			atomic<size_t> misses( 0 );
			uint8_t        run = false;
			size_t         atm;
			vector<thread> threads;
			for( unsigned i = 0; i != nThreads; ++i )
			{
				threads.emplace_back( atf, ref( run ), ADDS, ref( atm ), ref( misses ) );
#if defined(_MSC_VER)
				SetThreadAffinityMask( threads.native_handle(), (DWORD_PTR)1 << i );
#elif defined(__unix__)
				cpu_set_t cpuset;
				CPU_ZERO(&cpuset);
				CPU_SET(i, &cpuset);
				pthread_setaffinity_np( threads.native_handle(), sizeof cpuset, &cpuset );
#endif
			}
			time_point<high_resolution_clock> start = high_resolution_clock::now();
			run = true;
			for( unsigned i = 0; i != nThreads; ++i )
				threads.join();
			uint64_t ns = (uint64_t)duration_cast<nanoseconds>( high_resolution_clock::now() - start ).count();;

			double nsPerAdd = (double)ns / nThreads / ADDS / 1.0e9;
			cout << "threads: " << nThreads << " cycles: " << nsPerAdd / nsPerClockCycle << " misses-ratio: " << (int)(100.0 * (size_t)misses / nThreads / ADDS) << "%" << endl;
		}
		cout << endl;
	}
}

bool hasTSX()
{
#if defined(_MSC_VER)
	int regs[4];
	__cpuidex( regs, 7, 0 );
	return regs[1] & (1 << 11);
#else
	return true;
#endif
}

The code tests for TSX-capabilities only on Windows, so for the third test the code might crash on Linux-systems.
You need to run it with the base-clock of your processor as a parameter, i.e. "4.0" if your CPU has 4.0GHz.
With gcc or clang, the code has to be compiled with -mrtm (to enable RTM-support).

jimdempseyatthecove · ‎09-29-2019

Your test code is not representative of how you describe your application code.

... needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.
So what I'm thinking about is if it might me faster to replace these CMPXCHGs with RTM transactional memory

Your test code is measuring the RTM overhead for each of the three equivalent CMPXCHGs (rtmFetchAdd) as opposed to a correct (my assumption based on description) method of placing the RTM region around all three of the three equivalent CMPXCHGs (3x v+=a).

You should be using something like:

if(_xbegin() == _XBEGIN_STARTED)
{
    diddleOnce();
    diddleTwice();
    diddleThrice();
   _xend();
   return true; // all three operations completed successfully
} else {
return false; // Collision
}

Jim Dempsey

Montero__Bonita · ‎09-29-2019

jimdempseyatthecove (Blackbelt) wrote:
Your test code is not representative of how you describe your application code.
... needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.

Two of these act on a shared reader, multiple writer mutex with special ordering for readers and writers and prioritization of writers. And one is used for some lock-free purpose. So all CMPXCHGs have to be their own transaction. I simply want to know whether a transaction on a single value in memory could be faster than LOCK CMPXCHG.

Montero__Bonita · ‎09-29-2019

I got some horrible results from someone with a dual-core RTM-enabled CPU:

~ >>> clang -O3 -mrtm test.cpp -o test2 -lstdc++ -lpthread                                                                                                                                                                                  
~ >>> ./test2 3.0                                                                                                                                                                                                                           
xadd-thread:
threads: 1 cycles: 21.2706 misses-ratio: 0%
threads: 2 cycles: 47.6577 misses-ratio: 0%
threads: 3 cycles: 52.6984 misses-ratio: 0%
threads: 4 cycles: 68.9667 misses-ratio: 0%

cmpxchge-thread:
threads: 1 cycles: 19.0154 misses-ratio: 0%
threads: 2 cycles: 68.8116 misses-ratio: 37%
threads: 3 cycles: 76.2158 misses-ratio: 89%
threads: 4 cycles: 137.463 misses-ratio: 162%

rtm-thread:
threads: 1 cycles: 169.928 misses-ratio: 0%
threads: 2 cycles: 11993 misses-ratio: 7012%
threads: 3 cycles: 6693.75 misses-ratio: 5003%
threads: 4 cycles: 7583.44 misses-ratio: 7252%

First, either the time setting up or closing a transaction is very expensive so that the overall transaction has a high cost.
Second, the rate of collisions is disgusting, even on this dual-core-CPU. Imagine having about 70 retries incrementing just a single size_t!
So if RTM has such high costs it seems only good for cases in which they prevent a kernel-wait, which would induce even higher costs.

jimdempseyatthecove · ‎09-30-2019

Looks bad for single CMPXCHG. However you still need to examine your complete sequence of CMPXCHG's to successfully pass through your Lock-Free (-Wait-Free?) code.

It is not correct to assume the sum of three CMPXCHG's is the true cost of success (and cost of failure). This said, this RTM test does not look favorable.

Jim Dempsey