How tasks are allocated? Cache aligned or not?

renorm · ‎07-27-2010

TBB uses overloaded operator new to allocate tasks. Does it allocate on cache lines? Can I make it do so?

Thank you,
renorm.

Alexey-Kukanov · ‎07-27-2010

The overloaded operator new eventually uses an internal function called NFS_Allocate, the same function that is used by cache_aligned_allocator. So allocated tasks are aligned on cache line size.

However there is another aspect. The space for a task is divided between the task object itself(i.e. instance of a class derived from tbb::task) and the task prefix, an instance of an internal class. As it follows from its name, the task prefix precedes the task object. So the user-visible task object (i.e. the address returned by the operator new) has weaker alignment of 16 bytes.

renorm · ‎07-28-2010

Let's say, I want to run parallel_for with the Body defined below. Because tasks are cache aligned, I don't have to copy the variable counter into the function stack in order to avoid false sharing. Do I understand it right?

[cpp]struct Body {
    void operator() (const blocked_range&) const {
        // Copy counter into stack. Do I need to do it?
        int counter_copy = counter;
        
        // Here counter_copy is used very heavily and then copied back
        [...]
        counter = counter_copy;
    }

    int counter;
};[/cpp]

Thank you,
renorm.

RafSchietekat · ‎07-28-2010

With the Body being a member of a tbb::task subclass, you wouldn't have false sharing of counter, that's correct. Just be sure to initialise it. :-)

Alexey-Kukanov · ‎07-29-2010

Right, false sharing is not an issue. I would however check whether compiler is able to optimize all the heavy work using a register.
And also, since operator() is the only call executed for Body (not counting construction and destruction), how does the class member differ from a local variable in the given example? It seems to me that the counter value is useless after operator() finishes, because right after that the completed task is destroyed and so is the Body object it contains.

RafSchietekat · ‎07-29-2010

I was wondering what the counter value was for, too. Maybe this is something for parallel_reduce, instead?

renorm · ‎07-30-2010

Thanks for the replys. The code fragment I posted is delibaretly made very simple. There is more to the body I use in my project. Each instance of Body initializes counter from a global atomic variable and then counter is used very heavily. Call operator can invoke other class method. The other methods use counter as well, but not heavily. Once all work is done, counter is added to the global atomic variable. It is good to hear that TBB uses cache aligned allocation for its taks. Can we say the about about boost threads?

Regards,
renorm.

RafSchietekat · ‎07-31-2010

That sounds very suspicious: you read an atomic into "counter", modify it, and then add the final value back to the atomic? I see no meaningful result from doing something like that, because the initial value of "counter" depends on the timing of the scheduler, and part of what is added to the atomic is an earlier value of that same atomic. How can you explain that what you are doing is correct?

renorm · ‎07-31-2010

Could it be because my code isn't very detailed? Here is the explanation of how the atomic variable is used.

Think of atomic variable as some sort of checking account. We withdraw some amount from the account, use all of it and go back for more, until the account is exhausted or we have nothing else to purchase. If somehow we don't spend all the money in our pocket, we deposit it back into the account. Each task checks the account and if is not empty, the task withdraws some small amount for it its work and stores it into the variable counter. Once the work is complete, the task dumps the results into a concurrent vector and goes back for more money.

RafSchietekat · ‎07-31-2010

So the atomic is changed when "counter" is set, that's new information. Seems like a serious scalability issue, though.

renorm · ‎08-02-2010

Why is that? The atomic is a bank account (sort of). It must be synchronized. The other alternative is to use mutex. The atomic is changes once before heavy work starts and once after that. This way I can create only as many threads as available physical threads and keep the amount of work between updates of the atomic relatively small. It should be faster than creating many tasks using parallel_for, because the total amount of work isn't not known.

RafSchietekat · ‎08-02-2010

If it must be synchronised with the tasks for some business reason, then so be it, but that doesn't make it scalable. If you "keep the amount of work between updates of the atomic relatively small", that means that there are relatively many updates, and they imply contention for access to the atomic. Others should be able to provide benchmark data, but recursive parallelism is a central idea in TBB for a reason.

If this turns out to be problematic, I would perhaps try something with parallel_reduce (distributing cash in the splitting constructor), and if the cash runs out for a particular subrange push that subrange onto a concurrent_vector for a subsequent round, cook until done. Just a random idea, though, not something I've thought through.

(I removed a part that had to do with the required vs. available memory semantics for read-modify-write of the atomic, because I don't know the business requirements anyway.)

renorm · ‎08-02-2010

This is a sketch of the pattern I am using in my project.

[cpp]atomic totalCount; // total amount of work
int grainSize; // how much work to do before requesting more
int threads; // how many threads to use

// each threads gets a worker. Worker is expensive to construct and copy.
vector > workers(threads);

concurrent_vector concurrent_storage; // store results here

class Body {
public:
    void operator()(const blocked_range& r) {
        while (true) {
            int count = totalCount.fetch_and_add(-grainSize);
            if (count <= 0)
                return; // nothing to do, bail out

            counter = grainSize;
            Worker& worker = workers[r.begin()];  // get a worker
            Data data;
            for (; counter!=0; --counter) {
                // use worker to generate data
            }

            concurrent_storage.push_back(data);
        }
    }

private:
    int counter;
};

// run parallel tasks
parallel_for(blocked_range(0, threads, 1), Body(), simple_partitioner());[/cpp]

What minimum amount of work can be inside the for loop without hurting scalability? With lightweight Worker parallel_for is OK to use, as long as grainSize is large enough. Unused task would simply return without doing anything.

RafSchietekat · ‎08-03-2010

Hmm, using task-oriented constructs just to manage threads, that doesn't bode well. Apparently you allow more work to be done than totalCount indicates, unless totalCount is a multiple of grainSize. Why not instead run a parallel_for over totalCount, with the expensive Worker objects referenced from thread-local storage?

Anton_Pegushin · ‎08-03-2010

Hello,

it might be an interesting comparison to run, actually. It seems like you're trying to substitute TBB overhead (task stealing; taking a task out of a local queue; managing parent tasks and continuations) with a centralized atomic-based mechanism. The problem I see with your approach is that the one atomic integer you're using will become highly contended (on larger machines especially) and it'll affect the overall application performance. If grainSize represents some significant amount of work, then the contention will be less noticeable, but there'll be a chance of limiting parallel slackness and running into load imbalance problem hurting the overall performance again. Anyway, I reworked your code to follow TBB paradigm and if you can compare the performance of these two and post it here, I'd be interested to learn the results.

[bash][/bash]

[cpp]const int totalCount = 1000; // total amount of work
const int grainSize = 10;  // how much work to do before requesting more
//int threads;    // use automatic number of threads

// each threads gets a worker. Worker is expensive to construct and copy.
// enumerable_thread_specific uses cache_aligned_allocator by default
enumerable_thread_specific workers;

concurrent_vector concurrent_storage; // store results here

class Body {
public:
    void operator()(const blocked_range& r) const {
        int count = 0;

        Worker& worker = workers.local();  // get a worker
        Data data;

        for(int it = r.begin(); it < r.end(); ++it) {
            // use worker to generate data
            buffer[count++] = data;
        }

        std::copy(buffer, buffer + count, concurrent_storage.grow_by(count));
    }
private:
    mutable Data buffer[grainSize];
};  

// run parallel tasks
parallel_for(blocked_range(0, totalCount, grainSize), Body(), simple_partitioner());[/cpp]

Please note that I changed the way you accessed the concurrent_storage. Pushing back each element from every thread is quite expensive and should be bufferized instead.

RafSchietekat · ‎08-03-2010

I would make buffer local to operator(), do away with simple_partitioner (that was for the threads-related code), and process buffer-sized batches from blocked_range r until done (fewer tasks that way). I'm not sure how the size of buffer would relate to the grainsize in the parallel_for over totalCount, though.

(Added) Actually, why not keep the results local as well until the end...It's not going to do much for performance to avoid sharing the atomic only to then share access to the result. But I think the main thing would be better structure and maintainability: the idea about performance just arose because of how information became available to us, and I don't really want to promise anything because of this change. But you would get a better chance this way.

renorm · ‎08-03-2010

We may think that totalCount = n*grainSize. I though about TSL, but the pattern I use looks simple (at least to me). As a general rule, I don't place anything into TSL, unless there is no other way around. Btw, what are the performance implications of placing workers into TSL? Worker object does very heavy duty calculations and uses plenty of memory.

renorm · ‎08-03-2010

Anton,
I will try your pattern and see what happens.

renorm · ‎08-03-2010

I tested it on Core2 duo by gradually decreasing grainSize. As the grainSize became smaller than 100 (each granule is something like 5-10 floating point operations) , the TSL version lost more performance than the pattern with atomic variable. With large grain size the TSL version was slightly slower.

RafSchietekat · ‎08-03-2010

And now with my additional suggestions in #15? :-)

It means giving each thread its own entry in a concurrent_vector of vector for the results, index registered in TLS. The auto_partitioner instead of the simple_partitioner will avoid too many accesses to TLS, and then it's all local (as far as we've been told) until the post-join merge of the results. No tuning required, i.e., no grainsize involved.

Anton_Pegushin · ‎08-04-2010

Hi, so if I understand you correctly, for small grainsizes TLS version is better, but for larger grainsizes it's a little slower? Could you specify the real 'totalCount' you're using and the amount of work it represents (assuming the app is executed without TBB in a single-thread mode) in seconds? And the grainSizes you're using for the experiments. I'm just trying to figure out the task sizes and the amount of those. And another thing - how many cores does the machine have, that you're using for the experiments?