How tasks are allocated? Cache aligned or not? - Page 2

renorm · ‎07-27-2010

TBB uses overloaded operator new to allocate tasks. Does it allocate on cache lines? Can I make it do so?

Thank you,
renorm.

renorm · ‎08-04-2010

The actual grain size doesn't matter. What matters is the amount of work each task performs, or the amount of work between successive writes to the atomic variable. I will post compilable fragment sometime later.
On 2 core machine TSL versions slowed down more as grain size decreased. TSL version replaces each write to atomic with a task creation. How does task creation compares to atomic writes (assuming tasks are lightweight)?

@Raf
I do roughly the same thing as you suggested in #15. Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method. It avoids moving data around and minimizes access to the concurent_vector.

RafSchietekat · ‎08-04-2010

"I do roughly the same thing as you suggested in #15."
But only part of it... :-)

"Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method."
You don't need to swap at all, because concurrent_vector entries stay put right where they are by design.

"It avoids moving data around and minimizes access to the concurent_vector."
A reference or pointer stays valid as well, if you want to avoid evaluating a concurrent_vector index (how expensive is that, anyway?).

renorm · ‎08-06-2010

This thread got me to rethink/redesign some parts of my code. So here is another basic question. Does it make any sense to cache align a dynamically allocated instance of tbb::atomic? In general, atomic shouldn't be heavily contested. But what if the writes are rare but reads are frequent? Can it hurt scalability?

RafSchietekat · ‎08-06-2010

A location in memory can easily be shared among several caches, each with an identical copy. Only when somebody wants to write is it necessary to acquire unique ownership by telling the other caches to invalidate their copy (details vary), which takes a bit of time.

Alignment can be useful to avoid false sharing between unrelated items if at least one of them is written often. If it's mostly reads, or if anybody writing something is going to write the other items also, there's no need to keep them apart.

Well, that's my understanding, feel free to point out any mistakes or omissions.

renorm · ‎08-06-2010

That is my understanding too.
How does tbb::cache_aligned_allocator manage its memory pool? Does it keep delocated memory for quick recycling? I noticed that after cache aligned containers go out of scope, the process' memory usage doesn't drop. The same thing doesn't happen with STL allocator.

RafSchietekat · ‎08-07-2010

Last time I looked the scalable memory allocator kept all memory it ever used at the high-water mark, and I think aligned memory sits on top of that. If you reassign new and delete to use the scalable memory allocator for better performance, you'll see the same thing occur with STL.

(That wasn't very clear without the addition.)

renorm · ‎08-07-2010

OK, I redesigned my code. Now the pattern looks something like this.

[cpp]// Bundle together Worker and expensive mutable variables
struct mutable_state;

// Body is cheap to copy
struct Body {
    shared_ptr > tlsState;

    void operator()(const blocked_range&) const {
        mutable_state& state = tlsState->local();
        // use TLS here...
    }
};

// auto_partitioner is used by default
parallel_for(blocked_range(0, TotalWork/*can be a big number*/), Body());[/cpp]

Default copy constructor is OK and parallel_for can create large number of copies for better work balancing.

RafSchietekat · ‎08-07-2010

The shared_ptr looks weird (why not just a reference or pointer, and how could tlsState.local() compile?).

tbb::flattened2d seems nice to process the results.

Do tell us what you find out about performance.

renorm · ‎08-08-2010

You are right. It won' compile. The pointer must be dereferenced. It is fixid now.
auto_partitioner was slightly slower.
shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually.

RafSchietekat · ‎08-08-2010

"auto_partitioner was slightly slower."
That's difficult to believe.

"shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually."
The instance must be the same across Body instances, sure, but that doesn't mean it has to be dynamically allocated instead of on the stack.

renorm · ‎08-08-2010

auto_partitioner was indeed slightly slower. Compiler optimization could be the reason. With auto_partitioner loop sizes are not known at compile time. I see no other reason.

The difference is about 5% or less in simplified tests. With any realistic load it should all go away.

P.S. I did test the scalability on hyperthreaded 8 core system and the parallel version run 9.4 times faster reagardless of parallel pattern I used. The speedup on dual core PC is exactly 100%.

RafSchietekat · ‎08-08-2010

"auto_partitioner was indeed slightly slower. Compiler optimization could be the reason. With auto_partitioner loop sizes are not known at compile time. I see no other reason."
Hmm, if that were so, you would be able to nest compiler-optimisable loops inside an outer loop, and surely loops don't need to be of known size for the compiler to be able to optimise anything? Maybe some loop limit should be hoisted out of the loop into a constant, maybe something is still being shared... But it's difficult to say much without details.

renorm · ‎08-09-2010

It could be false sharing or something was optimized out in my toy program. The difference became exactly zero once I switched to a realistic setup.

What is the best way to create TLS objects distinct for each thread?

I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?

RafSchietekat · ‎08-09-2010

"What is the best way to create TLS objects distinct for each thread?"
If there are problems with enumerable_thread_specific, I'll have to defer to others.

"I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?"
Isn't that what enumerable_thread_specific::local() does? The reference for flattened2d has an example, I think.

Anton_Pegushin · ‎08-09-2010

Hi, I'm not sure what you mean by:

"What is the best way to create TLS objects distinct for each thread?"

If this refers to enumerable_thread_specific, then an object of this class should really be viewed as a container of elements, where each element is tied to one particular threads and there are as many elements as there are threads accessing enumerable_thread_specific object.

renorm · ‎08-09-2010

Yes, I am using enumerable_thread_specific. I can't iterate through the container before all threads call local(). What I want is the same as using distinct exemplar for each thread. One way to do it is to trigger lazy evaluation of all thread local copies and then overwrite all default constructed elements by iterating through the container.

ARCH_R_Intel · ‎08-09-2010

There isa constructor for enumerable_thread-specific that takes a functor finit, which is used to generate the local copies. That could be used to create a distinct exemplar for each thread, even thougn initialization is lazy.

renorm · ‎08-10-2010

I see. Does finit have to be re-entrant? Do I need to synchronize the internals of finit or the callbacks from enumerable_thread_specific are synchronized?

Another relevant question. Does TLS work with OpenMP and Boost threads?

Thank you,
renorm.

Anton_Pegushin · ‎08-11-2010

Hello, yes, it should be safe to evaluatefinitconcurrently, since it can be executed by several threads during their corresponding TLS elements initialization.

enumerable_thread_specific will work for all threads created in the application, doesn't matter if those are OpenMP ones or if you created native threads using OS calls. Once you access this threads element of enumerable_thread_specific by calling local() member function, the element for this thread will be created.

renorm · ‎08-12-2010

Hi Anton,
Can finit have mutable static state? finit is passed by value and all copies will generate the same result unless they use some shared state. To be more spesific, is this Finit OK?

[cpp]struct Finit {
    static int count; // initially = 0
    int operator ()() {
        return count++;
    }
};
[/cpp]

Anton_Pegushin · ‎08-12-2010

Hi, well, this is exactly what we call a non-thread-safe function :). Although the copies of finitused by different threads to evaluate an element in their TLS will be different, all of them will reference the same value in the static memory - count. If you want this to be thread-safe, make it a:

[cpp]static tbb::atomic count;[/cpp]