- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello I've been writing a parallel tile-based software renderer in my spare-time using tbb. Recently I've been profiling and optimizing my code.
I've been using concurrent_vector to store results/intermediate values (screen-space triangles + set-up data) when using parallel algorithms and quite often my profiles shows a significant amount of time being spent on concurrent_vector::push_back yes I do use concurrent_vector::reserve in advance.
I get a lot better results using a TLS of std::vectors, it kind of makes sense as there is no or very little contention but the downside I see is I loose the use of single contiguous block of memory for a single (lock-free) vector which isn't a big deal really since I still have contiguous blocks of memory per-thread.
I was wondering is this generally the preferred method of creating results in parallel algorithms? anything I maybe missing here?
I've been using concurrent_vector
I get a lot better results using a TLS of std::vectors, it kind of makes sense as there is no or very little contention but the downside I see is I loose the use of single contiguous block of memory for a single (lock-free) vector which isn't a big deal really since I still have contiguous blocks of memory per-thread.
I was wondering is this generally the preferred method of creating results in parallel algorithms? anything I maybe missing here?
Link Copied
1 Reply
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With a concurrent_vector, you're still sharing the location of the last element with at least read-modify-write cost, and there may be false sharing between elements associated with different threads that share a cache line, so TLS doesn't seem like a bad idea to me, e.g., enumerable_thread_specific used with flatten2d.
If you can let each thread grow a shared concurrent_vector by a sufficient number of elements at once, to amortise the cost of modifying the location of the last element and reduce the number of shared cache liines, you should get comparable performance.
Corrections, feedback, test results most welcome.
If you can let each thread grow a shared concurrent_vector by a sufficient number of elements at once, to amortise the cost of modifying the location of the last element and reduce the number of shared cache liines, you should get comparable performance.
Corrections, feedback, test results most welcome.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page