Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2455 Discussions

parallel_reduce performance - filter vector


Hi there,

I've been playing around with some performance measurements removing elements from a vector (as I'm sure everyone reading this is aware there are a number of varying methods each with different performance characteristics. They range from the N squared to N). In my example, I have a bunch of entities and I'd like to remove all the ones that are no longer alive.

I wanted to see if it was possible to speed up one of these algorithms with parallel_reduce, but I've found the performance is always worse when attempting to use tbb::parallel_reduce as opposed to just using a serial algorithm.

I feel like I must be doing something wrong so wanted to ask for expert advice on how best to implement this using tbb.

A serial version of the algorithm might look like this:

void removeDeadEntities_remove() {
            entities_.begin(), entities_.end(),
            [](const Entity& entity) { return !entity.alive_; }),

This is the most efficient serial version I've tested.

Now suppose the number of entities could grow to be very large (say millions) and I'd like to use parallel_reduce to get decent performance in both medium and large cases.

I created a parallel_reduce function object that looks like this:

struct EntityReduce
    std::vector<Entity> alive_;
    EntityReduce(EntityReduce&, tbb::split) {}
    void operator(const tbb::blocked_range<std::vector<Entity>::iterator>& r) {
        std::vector<Entity> alive;
        alive.reserve(r.end() - r.begin());
            r.begin(), r.end(), std::back_inserter(alive),
            [](const Entity& entity){ return !entity.alive_; });
        alive_.insert(alive_.end(), alive.begin(), alive.end());

    void join(const EntityReduce& reduce) {
        alive_.insert(alive_.end(), reduce.alive_.begin(), reduce.alive_.end());

And then I call it like this

    tbb::blocked_range<std::vector<Entity>::iterator>(entities_.begin(), entities_.end()), 

Now this isn't ideal as I'm using O(n) amount of memory here as opposed to the serial version but because of iterator invalidation I can't have two threads operating on the same vector at the same time and calling erase (I've tried a version where I use the erase-remove idiom on a copy of the blocked_range passed in, and then combine it with the member alive_, but this didn't give great results either).

Is there an optimum approach to making an algorithm such as this parallel? Do I need to use something other than parallel_reduce? Is there a way to reduce the amount of copying required to support a parallel version?

I'd be very interested to hear any advice on how best to approach this particular problem.

Thank you very much for your help.



0 Kudos
3 Replies


    Have you thought about the concurrent vector?  You can read about the concurrent vector in the documentation (you can get there from -> Documentation -> latest -> Reference Guide):

In addition, this is an old discussion but might be still relevant:

Specifically quoting Anton M.:


You could use the same technique as std::vector::erase() uses - shifting items ina tail to newly free space. But it is not thread-safe in terms of parallel access to the same items. And it doesn't deallocate the memory.

In order to free a memory, you need copy anyway.


You could also watch the excerpt from the recorded webinar, starting time 3:45:

It is recommended to use the TBB concurrent list if you would like to erase the element.

I will also alert TBB Developer about your post.


0 Kudos

Hi Tom,

If the compiler you are using implements C++17 version of the standard, there should be execution
policies available, which should do all the tricky magic under the hood, at least in theory.

In this case, in your first code snippet simply put necessary execution policy (such as "par" or
"par_unseq") to the call of remove_if.

As we know, remove_if algorithm sort of splits the range of elements into two parts: those which
satisfy the predicate and those which do not.

The problem with executing remove_if in parallel is that each individual subtask running on a
subrange of elements does not know where to move the element, since the other subtasks running in
parallel with each other are competing against the current value of the index where to move their
element(s) satisfying the predicate.

If the order of the elements in the vector is not important then you might want to use the atomic
counter that holds the index on the current pivot element in the container. This solution is good
since it does not use additional memory for intermediate results, though it is still bad since this
atomic counter becomes a shared memory that is heavily used, which worsens with increasing number of
participating threads. However, this solution still maybe considered as good enough solution.

As for the memory use, the solution at the end of your post in fact might use much more then O(n) of
additional memory, since the bodies of parallel_reduce might be copied several times during the

A better implementation of the remove_if is implemented in Parallel STL, which uses O(n) of additional memory. You can find Parallel STL project here. In particular, remove_if with "par" execution policy dispatches to this

Hope this helps,
Aleksei F.

0 Kudos

Hi Mark and Aleksei,

Thank you both very much for getting back to me and for your detailed replies. It's good to know I wasn't missing something obvious. I'll take a look at both concurrent vector and the parallel execution policies with the C++17 std library, thanks for all the links too! I'll be sure to check them out!

Thanks very much for your help,

Kind regards,



0 Kudos