<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Your test code is not in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169418#M6620</link>
    <description>&lt;P&gt;Your test code is not representative of how you describe your application code.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;...&amp;nbsp;needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.&lt;BR /&gt;So what I'm thinking about is if it might me faster to replace these CMPXCHGs with RTM transactional memory&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Your test code is measuring the RTM overhead for &lt;EM&gt;&lt;STRONG&gt;each &lt;/STRONG&gt;&lt;/EM&gt;of the three equivalent CMPXCHGs (rtmFetchAdd) as opposed to a correct (my assumption based on description) method of placing the RTM region around all three of the three equivalent CMPXCHGs (3x v+=a).&lt;/P&gt;&lt;P&gt;You should be using something like:&lt;/P&gt;&lt;P&gt;if(_xbegin() == _XBEGIN_STARTED)&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; diddleOnce();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; diddleTwice();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; diddleThrice();&lt;BR /&gt;&amp;nbsp;&amp;nbsp; _xend();&lt;BR /&gt;&amp;nbsp;&amp;nbsp; return true; // all three operations completed successfully&lt;BR /&gt;} else {&lt;BR /&gt;&amp;nbsp; return false; // Collision&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 29 Sep 2019 12:56:04 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2019-09-29T12:56:04Z</dc:date>
    <item>
      <title>TSX vs. LOCK-instruction</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169416#M6618</link>
      <description>&lt;P&gt;I've developed a LRU-cache that ist mostly lock-free, i.e. fetches that result in a cache hit can update the LRU-links in parallel.&lt;BR /&gt;A fetch that results in a LRU-hit to be spilled to the top of the LRU-links needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.&lt;BR /&gt;So what I'm thinking about is if it might me faster to replace these CMPXCHGs with RTM transactional memory. So has anyone here some performance-metrics if that could be advantageous?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Sep 2019 09:38:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169416#M6618</guid>
      <dc:creator>Montero__Bonita</dc:creator>
      <dc:date>2019-09-26T09:38:45Z</dc:date>
    </item>
    <item>
      <title>Can anyone here compile and</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169417#M6619</link>
      <description>&lt;P&gt;Can anyone here compile and run this code to check the performance of a simlple TSX-update on a size_t-variable against an atomic LOCK XADD or LOCK CMPXCHG? I only have a Ryzen and an older Xeon without TSX.&lt;/P&gt;
&lt;PRE class="brush:cpp; class-name:dark;"&gt;#if defined(_MSC_VER)
	#include &amp;lt;Windows.h&amp;gt;
	#include &amp;lt;intrin.h&amp;gt;
#elif defined(__unix__)
	#include &amp;lt;sys/sysinfo.h&amp;gt;
	#include &amp;lt;sched.h&amp;gt;
	#include &amp;lt;pthread.h&amp;gt;
	#include &amp;lt;immintrin.h&amp;gt;
#endif
#include &amp;lt;iostream&amp;gt;
#include &amp;lt;thread&amp;gt;
#include &amp;lt;cstddef&amp;gt;
#include &amp;lt;atomic&amp;gt;
#include &amp;lt;functional&amp;gt;
#include &amp;lt;chrono&amp;gt;
#include &amp;lt;vector&amp;gt;
#include &amp;lt;cstdlib&amp;gt;
#include &amp;lt;cmath&amp;gt;
#include &amp;lt;array&amp;gt;

bool hasTSX();

using namespace std;
using namespace chrono;

inline
size_t fetchAdd( size_t volatile &amp;amp;v, size_t a )
{
#if defined(_MSC_VER)
	#if defined(_M_X64)
	return (size_t)_InterlockedExchangeAdd64( &amp;amp;(__int64 &amp;amp;)v, (__int64)a );
	#elif defined(_M_IX86)
	return (size_t)_InterlockedExchangeAdd( &amp;amp;(long &amp;amp;)v, (long)a );
	#else
		#error unsupported architecture
	#endif
#elif defined(__GNUC__) || defined(__clang__)
	return __sync_fetch_and_add( &amp;amp;v, a );
#else
		#error unsupported architecture
#endif
}

inline
bool rtmFetchAdd( size_t volatile &amp;amp;v, size_t a )
{
	if( _xbegin() == _XBEGIN_STARTED )
	{
		v += a;
		_xend();
		return true;
	}
	else
		return false;
}

inline
size_t compareExchange( size_t volatile &amp;amp;v, size_t c, size_t x )
{
#if defined(_MSC_VER)
	#if defined(_M_X64)
	return (size_t)_InterlockedCompareExchange64( &amp;amp;(__int64 &amp;amp;)v, (__int64)x, (__int64)c );
	#elif defined(_M_IX86)
	return (size_t)_InterlockedCompareExchange( &amp;amp;(long &amp;amp;)v, (long)x, (long)c );
	#else
		#error unsupported architecture
	#endif
#elif defined(__GNUC__) || defined(__clang__)
	return __sync_val_compare_and_swap( &amp;amp;v, c, x );
#else
		#error unsupported architecture
#endif
}

int main( int argc, char **argv )
{
	if( argc &amp;lt; 2 )
		return -1;
	double nsPerClockCycle = 1.0 / (atof( argv[1] ) * 1.0e9);

	auto thrXadd = []( uint8_t volatile &amp;amp;run, size_t adds, size_t volatile &amp;amp;atm, atomic&amp;lt;size_t&amp;gt; &amp;amp;misses )
	{
		while( !run );
		for( size_t i = adds; i; --i )
			fetchAdd( atm, 1 );
	};
	auto thrXchg = []( uint8_t volatile &amp;amp;run, size_t adds, size_t volatile &amp;amp;atm, atomic&amp;lt;size_t&amp;gt; &amp;amp;misses )
	{
		while( !run );
		size_t missed = 0;
		for( size_t i = adds, cmp = atm; i; --i )
		{
			for( size_t res; ; )
				if( (res = compareExchange( atm, cmp, cmp + 1 )) == cmp )
				{
					cmp = cmp + 1;
					break;
				}
				else
					cmp = res,
					++missed;
		}
		misses.fetch_add( missed );
	};
	auto rtmAdd = []( uint8_t volatile &amp;amp;run, size_t adds, size_t volatile &amp;amp;atm, atomic&amp;lt;size_t&amp;gt; &amp;amp;misses )
	{
		while( !run );
		size_t missed = 0;
		for( size_t i = adds; i; --i )
			while( !rtmFetchAdd( atm, 1 ) )
				++missed;
		misses.fetch_add( missed );
	};
	using threadfunc = void (*)( uint8_t volatile &amp;amp;, size_t, size_t volatile &amp;amp;, atomic&amp;lt;size_t&amp;gt; &amp;amp; );
	array&amp;lt;threadfunc, 3&amp;gt;   atf;
	array&amp;lt;char const *, 3&amp;gt; threadDescr;
	size_t                 nTests;
	size_t const           ADDS = 10'000'000;
	unsigned               nProcessors = thread::hardware_concurrency();

	atf[0]         = thrXadd;
	atf[1]         = thrXchg;
	atf[2]         = rtmAdd;
	threadDescr[0] = "xadd-thread";
	threadDescr[1] = "cmpxchge-thread";
	threadDescr[2] = "rtm-thread";
	nTests         = hasTSX() ? atf.size() : atf.size() - 1;

	for( size_t m = 0; m != nTests; ++m )
	{
		cout &amp;lt;&amp;lt; threadDescr&lt;M&gt; &amp;lt;&amp;lt; ":" &amp;lt;&amp;lt; endl;
		for( unsigned nThreads = 1; nThreads &amp;lt;= nProcessors; ++nThreads )
		{
			atomic&amp;lt;size_t&amp;gt; misses( 0 );
			uint8_t        run = false;
			size_t         atm;
			vector&amp;lt;thread&amp;gt; threads;
			for( unsigned i = 0; i != nThreads; ++i )
			{
				threads.emplace_back( atf&lt;M&gt;, ref( run ), ADDS, ref( atm ), ref( misses ) );
#if defined(_MSC_VER)
				SetThreadAffinityMask( threads&lt;I&gt;.native_handle(), (DWORD_PTR)1 &amp;lt;&amp;lt; i );
#elif defined(__unix__)
				cpu_set_t cpuset;
				CPU_ZERO(&amp;amp;cpuset);
				CPU_SET(i, &amp;amp;cpuset);
				pthread_setaffinity_np( threads&lt;I&gt;.native_handle(), sizeof cpuset, &amp;amp;cpuset );
#endif
			}
			time_point&amp;lt;high_resolution_clock&amp;gt; start = high_resolution_clock::now();
			run = true;
			for( unsigned i = 0; i != nThreads; ++i )
				threads&lt;I&gt;.join();
			uint64_t ns = (uint64_t)duration_cast&amp;lt;nanoseconds&amp;gt;( high_resolution_clock::now() - start ).count();;

			double nsPerAdd = (double)ns / nThreads / ADDS / 1.0e9;
			cout &amp;lt;&amp;lt; "threads: " &amp;lt;&amp;lt; nThreads &amp;lt;&amp;lt; " cycles: " &amp;lt;&amp;lt; nsPerAdd / nsPerClockCycle &amp;lt;&amp;lt; " misses-ratio: " &amp;lt;&amp;lt; (int)(100.0 * (size_t)misses / nThreads / ADDS) &amp;lt;&amp;lt; "%" &amp;lt;&amp;lt; endl;
		}
		cout &amp;lt;&amp;lt; endl;
	}
}

bool hasTSX()
{
#if defined(_MSC_VER)
	int regs[4];
	__cpuidex( regs, 7, 0 );
	return regs[1] &amp;amp; (1 &amp;lt;&amp;lt; 11);
#else
	return true;
#endif
}&lt;/I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/M&gt;&lt;/M&gt;&lt;/PRE&gt;

&lt;P&gt;The code tests for TSX-capabilities only on Windows, so for the third test the code might crash on Linux-systems.&lt;BR /&gt;You need to run it with the base-clock of your processor as a parameter, i.e. "4.0" if your CPU has 4.0GHz.&lt;BR /&gt;With gcc or clang, the code has to be compiled with &lt;STRONG&gt;-mrtm&lt;/STRONG&gt; (to enable RTM-support).&lt;/P&gt;</description>
      <pubDate>Fri, 27 Sep 2019 14:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169417#M6619</guid>
      <dc:creator>Montero__Bonita</dc:creator>
      <dc:date>2019-09-27T14:05:00Z</dc:date>
    </item>
    <item>
      <title>Your test code is not</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169418#M6620</link>
      <description>&lt;P&gt;Your test code is not representative of how you describe your application code.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;...&amp;nbsp;needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.&lt;BR /&gt;So what I'm thinking about is if it might me faster to replace these CMPXCHGs with RTM transactional memory&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;Your test code is measuring the RTM overhead for &lt;EM&gt;&lt;STRONG&gt;each &lt;/STRONG&gt;&lt;/EM&gt;of the three equivalent CMPXCHGs (rtmFetchAdd) as opposed to a correct (my assumption based on description) method of placing the RTM region around all three of the three equivalent CMPXCHGs (3x v+=a).&lt;/P&gt;&lt;P&gt;You should be using something like:&lt;/P&gt;&lt;P&gt;if(_xbegin() == _XBEGIN_STARTED)&lt;BR /&gt;{&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; diddleOnce();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; diddleTwice();&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; diddleThrice();&lt;BR /&gt;&amp;nbsp;&amp;nbsp; _xend();&lt;BR /&gt;&amp;nbsp;&amp;nbsp; return true; // all three operations completed successfully&lt;BR /&gt;} else {&lt;BR /&gt;&amp;nbsp; return false; // Collision&lt;BR /&gt;}&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 29 Sep 2019 12:56:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169418#M6620</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2019-09-29T12:56:04Z</dc:date>
    </item>
    <item>
      <title>[quote=jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169419#M6621</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove (Blackbelt) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Your test code is not representative of how you describe your application code.&lt;/P&gt;&lt;P&gt;&lt;EM&gt;...&amp;nbsp;needs mostly at least three 64-bit CMPXCHGs on a 64-bit-system when there's no collision with other threads which want to update this three values.&lt;/EM&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Two of these act on a shared reader, multiple writer mutex with special ordering for readers and writers and prioritization of writers. And one is used for some lock-free purpose. So all CMPXCHGs have to be their own transaction. I simply want to know whether a transaction on a single value in memory could be faster than LOCK CMPXCHG.&lt;/P&gt;</description>
      <pubDate>Sun, 29 Sep 2019 13:09:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169419#M6621</guid>
      <dc:creator>Montero__Bonita</dc:creator>
      <dc:date>2019-09-29T13:09:00Z</dc:date>
    </item>
    <item>
      <title>I got some horrible results</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169420#M6622</link>
      <description>&lt;P&gt;I got some horrible results from someone with a dual-core RTM-enabled CPU:&lt;/P&gt;
&lt;PRE class="brush:; class-name:dark;"&gt;~ &amp;gt;&amp;gt;&amp;gt; clang -O3 -mrtm test.cpp -o test2 -lstdc++ -lpthread                                                                                                                                                                                  
~ &amp;gt;&amp;gt;&amp;gt; ./test2 3.0                                                                                                                                                                                                                           
xadd-thread:
threads: 1 cycles: 21.2706 misses-ratio: 0%
threads: 2 cycles: 47.6577 misses-ratio: 0%
threads: 3 cycles: 52.6984 misses-ratio: 0%
threads: 4 cycles: 68.9667 misses-ratio: 0%

cmpxchge-thread:
threads: 1 cycles: 19.0154 misses-ratio: 0%
threads: 2 cycles: 68.8116 misses-ratio: 37%
threads: 3 cycles: 76.2158 misses-ratio: 89%
threads: 4 cycles: 137.463 misses-ratio: 162%

rtm-thread:
threads: 1 cycles: 169.928 misses-ratio: 0%
threads: 2 cycles: 11993 misses-ratio: 7012%
threads: 3 cycles: 6693.75 misses-ratio: 5003%
threads: 4 cycles: 7583.44 misses-ratio: 7252%&lt;/PRE&gt;

&lt;P&gt;First, either the time setting up or closing a transaction is very expensive so that the overall transaction has a high cost.&lt;BR /&gt;Second, the rate of collisions is disgusting, even on this dual-core-CPU. Imagine having about 70 retries incrementing just a single size_t!&lt;BR /&gt;So if RTM has such high costs it seems only good for cases in which they prevent a kernel-wait, which would induce even higher costs.&lt;BR /&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 30 Sep 2019 06:21:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169420#M6622</guid>
      <dc:creator>Montero__Bonita</dc:creator>
      <dc:date>2019-09-30T06:21:00Z</dc:date>
    </item>
    <item>
      <title>Looks bad for single CMPXCHG.</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169421#M6623</link>
      <description>&lt;P&gt;Looks bad for single CMPXCHG. However you still need to examine your complete sequence of CMPXCHG's to successfully pass through your Lock-Free (-Wait-Free?) code.&lt;/P&gt;&lt;P&gt;It is not correct to assume the sum of three CMPXCHG's is the true cost of success&amp;nbsp;(and cost of failure). This said, this RTM test does not look favorable.&lt;/P&gt;&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 30 Sep 2019 13:08:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/TSX-vs-LOCK-instruction/m-p/1169421#M6623</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2019-09-30T13:08:16Z</dc:date>
    </item>
  </channel>
</rss>

