- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you,
renorm.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The actual grain size doesn't matter. What matters is the amount of work each task performs, or the amount of work between successive writes to the atomic variable. I will post compilable fragment sometime later.
On 2 core machine TSL versions slowed down more as grain size decreased. TSL version replaces each write to atomic with a task creation. How does task creation compares to atomic writes (assuming tasks are lightweight)?
@Raf
I do roughly the same thing as you suggested in #15. Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method. It avoids moving data around and minimizes access to the concurent_vector.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I do roughly the same thing as you suggested in #15."
But only part of it... :-)
"Each threads buffers all data into a vector and puts the vector into concurrent_vector of vector using swap method."
You don't need to swap at all, because concurrent_vector entries stay put right where they are by design.
"It avoids moving data around and minimizes access to the concurent_vector."
A reference or pointer stays valid as well, if you want to avoid evaluating a concurrent_vector index (how expensive is that, anyway?).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alignment can be useful to avoid false sharing between unrelated items if at least one of them is written often. If it's mostly reads, or if anybody writing something is going to write the other items also, there's no need to keep them apart.
Well, that's my understanding, feel free to point out any mistakes or omissions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How does tbb::cache_aligned_allocator manage its memory pool? Does it keep delocated memory for quick recycling? I noticed that after cache aligned containers go out of scope, the process' memory usage doesn't drop. The same thing doesn't happen with STL allocator.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(That wasn't very clear without the addition.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]// Bundle together Worker and expensive mutable variables struct mutable_state; // Body is cheap to copy struct Body { shared_ptrDefault copy constructor is OK and parallel_for can create large number of copies for better work balancing.> tlsState; void operator()(const blocked_range &) const { mutable_state& state = tlsState->local(); // use TLS here... } }; // auto_partitioner is used by default parallel_for(blocked_range (0, TotalWork/*can be a big number*/), Body());[/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
tbb::flattened2d seems nice to process the results.
Do tell us what you find out about performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
auto_partitioner was slightly slower.
shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's difficult to believe.
"shared_ptr is needed because there could be multiple unrelated instances of Body. Without shared_ptr shared state variables must be managed manually."
The instance must be the same across Body instances, sure, but that doesn't mean it has to be dynamically allocated instead of on the stack.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The difference is about 5% or less in simplified tests. With any realistic load it should all go away.
P.S. I did test the scalability on hyperthreaded 8 core system and the parallel version run 9.4 times faster reagardless of parallel pattern I used. The speedup on dual core PC is exactly 100%.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hmm, if that were so, you would be able to nest compiler-optimisable loops inside an outer loop, and surely loops don't need to be of known size for the compiler to be able to optimise anything? Maybe some loop limit should be hoisted out of the loop into a constant, maybe something is still being shared... But it's difficult to say much without details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What is the best way to create TLS objects distinct for each thread?
I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"What is the best way to create TLS objects distinct for each thread?"
If there are problems with enumerable_thread_specific, I'll have to defer to others.
"I want to do the following. Run parallel_for with a dummy body to trigger lazy creation and then iterate through the TLS container. Is there any other way to trigger lazy creation?"
Isn't that what enumerable_thread_specific::local() does? The reference for flattened2d has an example, I think.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Another relevant question. Does TLS work with OpenMP and Boost threads?
Thank you,
renorm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can finit have mutable static state? finit is passed by value and all copies will generate the same result unless they use some shared state. To be more spesific, is this Finit OK?
[cpp]struct Finit { static int count; // initially = 0 int operator ()() { return count++; } }; [/cpp]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]static tbb::atomiccount;[/cpp]
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »