Code optimization and member variables

mrussel · ‎05-15-2008

I had difficulty with my first TBB program, I basically copied the parallel_reduce example from the TBB tutorial:

class SumFoo {

float* my_a;

public:

float sum;

void operator()( const blocked_range& r ) {

float *a = my_a;

for( size_t i=r.begin(); i!=r.end(); ++i )

sum += Foo(a);

}

SumFoo( SumFoo& x, split ) : my_a(x.my_a), sum(0) {}

void join( const SumFoo& y ) {sum+=y.sum;}

SumFoo(float a[] ) :

my_a(a), sum(0)

{}

};

Mytest just summed the contents of an array of doubles.parallel_for was twiceas slow as the serial test because variable sum was being stored and read from memory eachiteration, but in the serial case itwas optimized as a register variable. I'm using Visual Studio 2005.

My understanding was that the body objectis only accessed by one thread (except the splitting constructor), is there a memorybarrier being added by TBB?

Alexey-Kukanov · ‎05-15-2008

Wow, that's our Tutorial that promotes this "favorite" array summation example, which bad performance I analyzed in my blog... Seems we should fix the document.

Your understanding is right, the body object is updated by just one thread. But for parallel_reduce, it is passed by reference to many task objects, and of course there are memory barriers to process these tasks correctly. I believe referencing the same body in several tasks is enough for compilers to prevent using a register for the sum; or might be even just making it a class member is enough.

mrussel · ‎05-29-2008

I see how the function can be improved but don't understand whythe Visual C++compiler produces the slower code.

It can't unroll the loop because the condition is i != end instead of i < end, but why can't it use a register for the sum?

I've posted a sample program to Microsoft to see ifanyone can answer: http://forums.microsoft.com/msdn/ShowPost.aspx?siteid=1&postid=3417908 I've been able to take TBB out of the picture, it appears that just allocating the object on the heap is enough to cause this behavior.

robert-reed · ‎05-30-2008

Putting the object on the heap places it in a no-man's-land, in a multi-thread environment. The reason the compiler--Visual C++ or Intel's--needs to go to more protective code is that it can't be sure its register is the register holding the sum. There are other processors out there with their own imagined owning register otherwise and chaos ensues. The compiler has to assume that the pointers are essentially volatile and reread them each time they are used. But the good code can be emitted by the compiler by making the sum local as Alexey demonstrates at the end of his blog post.

RafSchietekat · ‎05-30-2008

I would think that a compiler is not in the business of making the results of races less severe by treating all variables as atomics, so I very much doubt that this is the correct explanation. Isn't the difference whether the compiler can (sufficiently easily) prove that no side effect of any code called between uses of the variable can independently change it? I think that having the variable in a register is the more amazing outcome: you may be annoyed that the code is slow and use your intelligence to point at the culprit and then focus your attention on tracking down why the variable could just as well be in a register (assuming you make no mistakes), but a compiler has a more limited scope before it has to deliver the goods (otherwise you will buy the competitor's faster compiler instead, so to speak), so you should help the compiler help you. One criterium a compiler might use is that something on the stack will not be changed unless a pointer is somehow passed to the called code (relatively easy to check), whereas anything on the heap might have a persistent reference to it stored somewhere (unless proved otherwise). Disclaimer: I cannot see the sample program behind the link, so I'm mixing things from the original post and any information provided about the sample program behind the link.

How does i!=end vs. i

robert-reed · ‎05-30-2008

It has less to do with treating all variables as atomics as not knowing whether there is aliasing going on during pointer dereferences and so not being assured of having private access, a requirement to promote a variable to register allocation. But I admit being blithe in my previous response, looking only to Alexey's example rather than to the code posted at MSDN. I've since rectified that situation and see two calls to the same summing class, one using a this pointer and the other using a this reference (ptc->add(...) vs tc.add(...)) which the compiler obviously can assume is alias-free because it's able to inline the add call andtreat tc.sum as a variable that can be promoted to a register whereas ptc->sum is held at arm's length and not promoted to a register by this compiler. It depends on how much the compiler knows about the pointer. I wonder in this case whether adding a -noalias (/Qnoalias maybe?) switch to the compile would be enough to allow the promotion in the former case as well as the latter.

jsanga · ‎05-30-2008

Is there a document somewhere that gives reasons or guidelines for this? I'ma lurker in the C++ moderated newgroup and have perused the C++ standard. When ever I ask a question about threads, the "Official" response is that C++ knows nothing about threads (let alone multiple cores). If this is the official response, why wouldn't a conforming compiler always use a register if possible. Only assume aliasing if the variable is declared volatile.

RafSchietekat · ‎05-31-2008

That is indeed what a conforming compiler may do, also with multithreading. To be safe for multithreading, a compiler basically must respect the intentions of a user who is avoiding all data races (specified in a way no mere mortal can understand) to "memory locations" (only consecutive bit fields are not individually accessible), e.g., no speculative writes, only write back something bigger than a memory location if everything in it was just written to. (Did I forget anything?) See "Additions to atomic" for a few pointers.

I'm actually still wondering myself how we know we can use our existing compilers for multithreading without the benefit of an existing specification as thorough as what I first saw for Java.

(Removed)

robert-reed · ‎06-02-2008

I'd have to look for a document (nothing comes to mind immediately) but the issues I raised in my last post here have nothing to do with the compiler's knowledge or lack of knowledge regarding multi-threading. They are merely the rules the compiler must follow when dealing with multiple compilation units that are linked together later and may introduce functions performing side effects on pointers passed to them. The designers of the compiler try to make it as smart as is safe to do given the conditions. Maybe something like the Intel Architecture Optimization manual might offer some advice (I haven't looked through it recently but it would be the first place I would look).

mrussel · ‎06-04-2008

Wouldn't the side effects be limited to the loop itself? Without multithreading issues, the compiler just has to focus on optimizing the loop which contains no side effects, unless dereferencing the iterator i introduces one.

class test_class

{

public:

int sum;

void add(vector::const_iterator &start, vector::const_iterator &end)

{

for (vector::const_iterator i=start; i != end; i++)

sum += *i;

}

};

Unless there's an issue with side effects that I'm not understanding, this must be a multithreading consideration of the compiler, it assumes that another thread could be accessing the sametest_class instance.

Alexey-Kukanov · ‎06-04-2008

I think that for dynamically allocated object, the compiler has to create and call an out-of-line version of the method, which conservatively assumes that the result of dereferencing the iterator could overlap with this->sum. In case of local object, however, the compiler has enough information to inline the method and ensure that sum does not overlap with the processed array, and can be kept in a register.

mrussel · ‎06-04-2008

I'm evaluating VTune currently, is there an event I could monitor to detect loops with this problem?

robert-reed · ‎06-06-2008

I suppose you could look at innermost cache hits/misses to detect activity not going to a register, but your life would be complicated by a variety of issues including what's the range and trip count of each loop (code optimization throws the loop bounds into question). (You might also go to whatif.intel.com and check out PTU, which will run on your Vtune analyzer license and can separate the cache access events by basic block.) Then you'd want to count the number of cache accesses per trip count and compare that to the number of accesses you'd expect with register storage of the intermediates (if you're doing a simple sum, one cache access per array element would be reasonable). With big trip counts it'll beeasier to discern than with smaller one and you'll also have to be aware of where the compiler unrolls a loop for performance, which will change the expected ratios, but it sounds like a lot of work.