enumerable_thread_specific< std::vector<T> > vs concurrent_vector<T>
Hello I've been writing a parallel tile-based software renderer in my spare-time using tbb. Recently I've been profiling and optimizing my code.
I've been using concurrent_vector to store results/intermediate values (screen-space triangles + set-up data) when using parallel algorithms and quite often my profiles shows a significant amount of time being spent on concurrent_vector::push_back yes I do use concurrent_vector::reserve in advance.
I get a lot better results using a TLS of std::vectors, it kind of makes sense as there is no or very little contention but the downside I see is I loose the use of single contiguous block of memory for a single (lock-free) vector which isn't a big deal really since I still have contiguous blocks of memory per-thread.
I was wondering is this generally the preferred method of creating results in parallel algorithms? anything I maybe missing here?
With a concurrent_vector, you're still sharing the location of the last element with at least read-modify-write cost, and there may be false sharing between elements associated with different threads that share a cache line, so TLS doesn't seem like a bad idea to me, e.g., enumerable_thread_specific used with flatten2d.
If you can let each thread grow a shared concurrent_vector by a sufficient number of elements at once, to amortise the cost of modifying the location of the last element and reduce the number of shared cache liines, you should get comparable performance.