Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

How tasks are allocated? Cache aligned or not?

renorm
Beginner
1,581 Views
TBB uses overloaded operator new to allocate tasks. Does it allocate on cache lines? Can I make it do so?

Thank you,
renorm.
0 Kudos
41 Replies
renorm
Beginner
649 Views

The actual grain size doesn't matter. What matters is the amount of work each task performs, or the amount of work between successive writes to the atomic variable. I will post compilable fragment sometime later.
On 2 core machine TSL versions slowed down more as grain size decreased. TSL version replaces each write to atomic with a task creation. How does task creation compares to atomic writes (assuming tasks are lightweight)?

@Raf
I do roughly the same thing as you suggested in #15. Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method. It avoids moving data around and minimizes access to the concurent_vector.
0 Kudos
RafSchietekat
Valued Contributor III
649 Views

"I do roughly the same thing as you suggested in #15."
But only part of it... :-)

"Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method."
You don't need to swap at all, because concurrent_vector entries stay put right where they are by design.

"It avoids moving data around and minimizes access to the concurent_vector."
A reference or pointer stays valid as well, if you want to avoid evaluating a concurrent_vector index (how expensive is that, anyway?).

0 Kudos
renorm
Beginner
649 Views
This thread got me to rethink/redesign some parts of my code. So here is another basic question. Does it make any sense to cache align a dynamically allocated instance of tbb::atomic? In general, atomic shouldn't be heavily contested. But what if the writes are rare but reads are frequent? Can it hurt scalability?
0 Kudos
RafSchietekat
Valued Contributor III
649 Views
A location in memory can easily be shared among several caches, each with an identical copy. Only when somebody wants to write is it necessary to acquire unique ownership by telling the other caches to invalidate their copy (details vary), which takes a bit of time.

Alignment can be useful to avoid false sharing between unrelated items if at least one of them is written often. If it's mostly reads, or if anybody writing something is going to write the other items also, there's no need to keep them apart.

Well, that's my understanding, feel free to point out any mistakes or omissions.
0 Kudos
renorm
Beginner
649 Views
That is my understanding too.
How does tbb::cache_aligned_allocator manage its memory pool? Does it keep delocated memory for quick recycling? I noticed that after cache aligned containers go out of scope, the process' memory usage doesn't drop. The same thing doesn't happen with STL allocator.
0 Kudos
RafSchietekat
Valued Contributor III
649 Views
Last time I looked the scalable memory allocator kept all memory it ever used at the high-water mark, and I think aligned memory sits on top of that. If you reassign new and delete to use the scalable memory allocator for better performance, you'll see the same thing occur with STL.

(That wasn't very clear without the addition.)
0 Kudos
renorm
Beginner
649 Views
OK, I redesigned my code. Now the pattern looks something like this.
[cpp]// Bundle together Worker and expensive mutable variables
struct mutable_state;

// Body is cheap to copy
struct Body {
    shared_ptr > tlsState;

    void operator()(const blocked_range&) const {
        mutable_state& state = tlsState->local();
        // use TLS here...
    }
};

// auto_partitioner is used by default
parallel_for(blocked_range(0, TotalWork/*can be a big number*/), Body());[/cpp]
Default copy constructor is OK and parallel_for can create large number of copies for better work balancing.
0 Kudos
RafSchietekat
Valued Contributor III
649 Views
The shared_ptr looks weird (why not just a reference or pointer, and how could tlsState.local() compile?).

tbb::flattened2d seems nice to process the results.

Do tell us what you find out about performance.
0 Kudos
renorm
Beginner
649 Views
You are right. It won' compile. The pointer must be dereferenced. It is fixid now.
auto_partitioner was slightly slower.
shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually.
0 Kudos
RafSchietekat
Valued Contributor III
649 Views
"auto_partitioner was slightly slower."
That's difficult to believe.

"shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually."
The instance must be the same across Body instances, sure, but that doesn't mean it has to be dynamically allocated instead of on the stack.
0 Kudos
renorm
Beginner
649 Views
auto_partitioner was indeed slightly slower. Compiler optimization could be the reason. With auto_partitioner loop sizes are not known at compile time. I see no other reason.

The difference is about 5% or less in simplified tests. With any realistic load it should all go away.

P.S. I did test the scalability on hyperthreaded 8 core system and the parallel version run 9.4 times faster reagardless of parallel pattern I used. The speedup on dual core PC is exactly 100%.
0 Kudos
RafSchietekat
Valued Contributor III
649 Views
"auto_partitioner was indeed slightly slower. Compiler optimization could be the reason. With auto_partitioner loop sizes are not known at compile time. I see no other reason."
Hmm, if that were so, you would be able to nest compiler-optimisable loops inside an outer loop, and surely loops don't need to be of known size for the compiler to be able to optimise anything? Maybe some loop limit should be hoisted out of the loop into a constant, maybe something is still being shared... But it's difficult to say much without details.
0 Kudos
renorm
Beginner
649 Views
It could be false sharing or something was optimized out in my toy program. The difference became exactly zero once I switched to a realistic setup.

What is the best way to create TLS objects distinct for each thread?

I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?

0 Kudos
RafSchietekat
Valued Contributor III
649 Views

"What is the best way to create TLS objects distinct for each thread?"
If there are problems with enumerable_thread_specific, I'll have to defer to others.

"I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?"
Isn't that what enumerable_thread_specific::local() does? The reference for flattened2d has an example, I think.

0 Kudos
Anton_Pegushin
New Contributor II
649 Views
Hi, I'm not sure what you mean by:
"What is the best way to create TLS objects distinct for each thread?"
If this refers to enumerable_thread_specific, then an object of this class should really be viewed as a container of elements, where each element is tied to one particular threads and there are as many elements as there are threads accessing enumerable_thread_specific object.
0 Kudos
renorm
Beginner
649 Views
Yes, I am using enumerable_thread_specific. I can't iterate through the container before all threads call local(). What I want is the same as using distinct exemplar for each thread. One way to do it is to trigger lazy evaluation of all thread local copies and then overwrite all default constructed elements by iterating through the container.
0 Kudos
ARCH_R_Intel
Employee
649 Views
There isa constructor for enumerable_thread-specific that takes a functor finit, which is used to generate the local copies. That could be used to create a distinct exemplar for each thread, even thougn initialization is lazy.
0 Kudos
renorm
Beginner
649 Views
I see. Does finit have to be re-entrant? Do I need to synchronize the internals of finit or the callbacks from enumerable_thread_specific are synchronized?

Another relevant question. Does TLS work with OpenMP and Boost threads?

Thank you,
renorm.
0 Kudos
Anton_Pegushin
New Contributor II
649 Views
Hello, yes, it should be safe to evaluatefinitconcurrently, since it can be executed by several threads during their corresponding TLS elements initialization.
enumerable_thread_specific will work for all threads created in the application, doesn't matter if those are OpenMP ones or if you created native threads using OS calls. Once you access this threads element of enumerable_thread_specific by calling local() member function, the element for this thread will be created.
0 Kudos
renorm
Beginner
628 Views
Hi Anton,
Can finit have mutable static state? finit is passed by value and all copies will generate the same result unless they use some shared state. To be more spesific, is this Finit OK?
[cpp]struct Finit {
    static int count; // initially = 0
    int operator ()() {
        return count++;
    }
};
[/cpp]


0 Kudos
Anton_Pegushin
New Contributor II
628 Views
Hi, well, this is exactly what we call a non-thread-safe function :). Although the copies of finitused by different threads to evaluate an element in their TLS will be different, all of them will reference the same value in the static memory - count. If you want this to be thread-safe, make it a:
[cpp]static tbb::atomic count;[/cpp]
0 Kudos
Reply