Hi,

Jeffrey_B_ · ‎08-30-2015

Hello,

I would like to understand how I can successfully build tbb 4.4 when I compile on a non-rtm supported cpu.

For example, using gcc 4.9.3 I can successfully compile tbb 4.4. on a machine with the following CPU:

Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz

However, when I use the same exact compiler binary on a machine with the following CPU:

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

I get the following errors:

g++ -o x86_rtm_rw_mutex.o -c -MMD -DTBB_USE_DEBUG -DDO_ITT_NOTIFY -g -O0 -DUSE_PTHREAD -m64 -mrtm -fPIC -D__TBB_BUILD=1 -Wall -Wno-parentheses -Wno-non-virtual-dtor -I../../src -I../../src/rml/include -I../../include ../../src/tbb/x86_rtm_rw_mutex.cpp
/tmp/cc49cXFH.s: Assembler messages:
/tmp/cc49cXFH.s:564: Error: no such instruction: `xtest'
/tmp/cc49cXFH.s:590: Error: no such instruction: `xabort $255'
/tmp/cc49cXFH.s:598: Error: no such instruction: `xabort $255'
/tmp/cc49cXFH.s:604: Error: no such instruction: `xend'
/tmp/cc49cXFH.s:765: Error: no such instruction: `xbegin .L56'
/tmp/cc49cXFH.s:923: Error: no such instruction: `xbegin .L71'
/tmp/cc49cXFH.s:1143: Error: no such instruction: `xabort $255'
make[1]: *** [x86_rtm_rw_mutex.o] Error 1

If I remove -mrtm then then I am able to build.

So my question is why does this happen and how is it possible to build tbb where I compile with -mrtm and then run on box that does not support these instructions, i.e. wouldn't I get SIGILL if it attempted to execute the code using the rtm instructions?

Thanks.

/JMB

RafSchietekat · ‎08-30-2015

That's funny, and I'm curious about the answer... But isn't that 4th-generation "Haswell" i7 processor affected by HSW136, meaning you might want to not use TSX?

Vladimir_P_1234567890 · ‎08-31-2015

hello all,

rtm is enabled for official binaries and for gcc 4.8+ build

@JMB: please check whether you have the same binutils ('as' version in particular) on both machines. Assembler should support rtm.

@Raf: There is a runtime dispatcher (tbb_misc.cpp file, cpu_has_speculation() routine) so in case a processor supports HLE speculative locks will use HLE, in case processor does not support HLE then speculative locks implementation will call regular locks implementation. I assume that HLE should be disabled for processors that are affected by HSW136.

--Vladimir

Jeffrey_B_ · ‎08-31-2015

Raf,

I am not quite sure what you think is funny (can you explain).

Here is the basic issue that concerns me. Let's say that I build the library on a Haswell box for a binary that might be used on various CPUs (including SandyBridge and IvyBridge). When the binary runs it will fail with SIGILL when any tbb code executes that uses rtm instructions. OTH, if I compile with the -mrtm flag removed then my build should work on any CPU (Haswell, SandyBridge, IvyBridge).

I think that type of default compilation behaviour is bizarre for a general purpose library. There is no mention of this issue in the build instructions which makes me think I might be missing something. Why does the library choose to use -mrtm just because the compiler supports the option? The seems bizarre to me.

Jeffrey_B_ · ‎08-31-2015

Vlad,

Ah, that makes sense on both fronts.

When we built gcc 4.9.3 we could not decide if we should also build binutils but I know have the answer.

Thanks for the info on runtime tbb speculative execution code that determines what code can run so that I don't need to worry about SIGILL being generated. Perhaps if I read the code more closely I could have figured that out so I appreciate your response.

Please ignore my response to Raf above as I responded before I saw your response.

Best,

/JMB

RafSchietekat · ‎08-31-2015

#3P3 That sounds like part of the explanation, I just assumed that this was tightly integrated into the compiler, with all the talk about what the optimiser is able to do with inline assembler code.

#3P4 Don't you mean RTM instead of HLE (which is backward compatible by itself)? But it does sound like the other part of the explanation.

#4P2 You say "bizarre", I say "funny".

Jeffrey_B_ · ‎08-31-2015

Raf,

I did a bit some reading of the code and realized that -mrtm is not required nor is the correct binutils. Crafty programmers did the following:

When -mrtm is not present , __RTM__ is undefined and thus __TBB_TSX_INTRINSICS_PRESENT is not defined. So, the code falls back to this thing inside include/tbb/machine/gcc_itsx.h:

inline static void __TBB_machine_end_transaction()
{
    __asm__ volatile (".byte 0x0F; .byte 0x01; .byte 0xD5" :::"memory");   // XEND
}

which is just good old fashioned hard coded asm and thus no reliance on binutils.

So, I get both, I get to compile with -mrtm but if I run on processor with support then I get the support.

Vladimir_P_1234567890 · ‎09-01-2015

Raf Schietekat wrote:

#3P4 Don't you mean RTM instead of HLE (which is backward compatible by itself)? But it does sound like the other part of the explanation.

Indeed:). but for processor flag check i guess we could check for HLE enabled.

Ilia_M_ · ‎10-09-2015

Hi,

I have a problem. I'm planning to use the TBB concurrent set in my project. But before, I tried to benchmark it. Here is my code:

#include <ctime>
#include <iostream>
#include <unordered_set>
#include <tbb/mutex.h>
#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>
#include <tbb/concurrent_unordered_set.h>

#include "SpookyV2.h"

struct Hash
{
	uint64_t operator()(uint64_t op) const
	{
		return SpookyHash::Hash64(&op, sizeof(op), 0);
	}
};

const size_t T_SIZE = 1 << 28;
#define STD
#ifdef STD
	typedef std::unordered_multiset<uint64_t, Hash> HashSet;	
#else
	typedef tbb::concurrent_unordered_multiset<uint64_t, Hash> HashSet;
#endif
HashSet set(T_SIZE);

struct Body
{
	void operator()(const tbb::blocked_range<size_t>& range) const
	{
		for (size_t i = range.begin(); i < range.end(); i++)
		{
			uint64_t value = (uint64_t(rand()) << 31) | rand();
			set.insert(value);
		}
	}
};

int main(int argc, char * argv[])
{
	tbb::task_scheduler_init init(1);
	tbb::parallel_for(tbb::blocked_range<size_t>(0, T_SIZE), Body());
	std::cout << double(clock()) / CLOCKS_PER_SEC << std::endl;
	return 0;
}

On my machine with i7-3770 the concurrent hash set with several threads runs slightly faster than the single thread unordered set. However, on our server it runs N times slower, where N is the number of threads, which is very frustrating. I used the binary distribution to compile my code on both machines. I tried to compile TBB out of sources, but got the same error as the thread starter. Can it be the culprit, i.e. do those instructions really affect the performance in such a dramatic way? And how can I know that a processor is the subject of HSW136?

Trouble building TBB 4.4. because of -mrtm