The Fibonacci example is

davidlofstrand · ‎06-23-2014

Hello there,

I'm working on a medical simulator where we have located a bottleneck that should be fairly easy to fix since it's a straight-forward recursive routine.

The located bottleneck is at line 9 in the code section below : if (!n1.m_bound.overlap(n2.m_bound))

Naturally I've already tried a stack based while loop instead of recursion but that actually ended up slowing down the execution so I went back to my original recursive implementation which in essence is the code I've pasted below.

The top-level pseudo-code calls to the collision routine are basically:

collide(object_a_root_index, object_a_root_nodes, object_b_root_index, object_b_root_nodes, resulting_ab_overlaps);

I have a couple simple questions before I begin to parallelize this code.

1) Would you suggest me use the fibonacci example as a starting point to parallelize this code or do you have more suitable example(s) for this problem?

2) In the fibonacci example you mention that tuning the cutOff parameter "requires some experimentation". I assume that you basically tune this parameter by hand and then reprofile the code until it's optimal for a specific target hardware, right?

3) I'm going to assume that you'll recommend using tbb::concurrent_vector instead of using the concurrency::concurrent_vector. Besides portability issues, what are the major differences between your implementation and the one provided in Parallel Patterns Library from Micrsoft?

  void collide(const int i1, const std::vector<BVHNode> &nodes1,
               const int i2, const std::vector<BVHNode> &nodes2,
               std::vector<std::pair<int,int>> &result)
  {
    const BVHNode &n1 = nodes1[i1];
    const BVHNode &n2 = nodes2[i2];

    //++num_overlap_tests;
    if (!n1.m_bound.overlap(n2.m_bound))
      return;

    if (n1.is_leaf_node())
    {
      if (n2.is_leaf_node())
        result.emplace_back(n1.m_primitive_idx, n2.m_primitive_idx);
      else
      {
        if (n2.m_left_idx >= 0)
          collide(i1, nodes1, n2.m_left_idx, nodes2, result);
        if (n2.m_right_idx >= 0)
          collide(i1, nodes1, n2.m_right_idx, nodes2, result);
      }
    }
    else
    {
      if (n1.m_left_idx >= 0)
        collide(n1.m_left_idx, nodes1, i2, nodes2, result);
      if (n1.m_right_idx >= 0)
        collide(n1.m_right_idx, nodes1, i2, nodes2, result);
    }
  }

Thanks in advance,

David Löfstrand

RafSchietekat · ‎06-24-2014

The Fibonacci example is about direct use of tasks (always the underlying machinery in TBB), but you should use a pre-packaged algorithm whenever you can, and a simple parallel_invoke() seems quite suitable here.

The tuning is probably not very dependent on the hardware, unless the sequential base case does something radically different other than just being sequential.

I don't know about Microsoft's concurrent_vector.

BTW, why exactly do you blame line 9, specifically?

jimdempseyatthecove · ‎06-24-2014

From looking at the code, I think the biggest potential misstep you might make is as you parallel recurse that you will pass the same reference to std::vector..."result" and that each thread will then have issues with collisions with result.emplace_back(n1.m_primitive_idx, n2.m_primitive_idx);

Jim Dempsey

davidlofstrand · ‎06-24-2014

Hello and thanks for the feedback!

I'm aiming to run the code I'm writing at roughly 1000Hz with quite a few overlap tests. Currently I reach about 550-600Hz using the sequential code I provided here. Naturally we have a lot more going on besides this; rebuilding the BVH tree, running the dynamics and calculating the collision resolution amongst a few things.

When I completely ignore the collision detection by just doing return at the beginning of the function the framerate is slightly above 2000Hz. This means that the time spent in this routine combined with the added collision resolution can only cost 1ms.

When I profile the code in VTune I can tell that the collision resolution is practically negligble in comparison to the collide() function.

The reason to why I focus on line 9 is a good question... It takes a significant portion of the time according to VTune but there is also a significant part of the time spent in the recursive calls. I think I just wrote down what was on top of the list.

Most of the previous bottlenecks were efficiently removed with trivial paradigms like parallel_for or parallel_reduce and by carefully choosing how to structure the data in order to reduce cache-misses. In general I was thinking that parallelizing the recursion could bring me a big chunk of the 0,6ms of improvement that I need.

The inital parallel implementation which I wrote today worked but didn't improve the speed at all but I suspect that I've done it wrong so I'll try out the parallel_invoke before I continue using tbb::task again.

As a side note I can tell that I dropped out on Microsoft's code since I got a huge amount of compiler warnings because they seem to throw a lot of exceptions which I don't catch. In any case I always try to avoid dabbling with c++ exceptions, especially in time-critical code.

I don't see how I can change the need to store the results in some parallel container in order to use it in the parallel recursion, or am I wrong?

David Löfstrand

RafSchietekat · ‎06-24-2014

I don't know if that's the answer here, but you could use tbb::enumerable_thread_specific with a normal std::vector to avoid fine-grained additions to a shared tbb::concurrent_vector (because you don't need to read from the result until after the computation has completed).

Also, the recursion seems a little single-sided, fully descending the left argument before anything is done with the right argument. Without knowing anything about the geometry, I would hazard a guess that you might have more luck short-circuiting the search if you descend both trees by alternating based on depth: even depth gets to descend into the left argument, odd depth gets to descend into the right argument. That way the bounding volumes on both sides decrease in size as you recurse, which should be clearly beneficial for a hypothetical starting point where both sides fully coincide.

If that still doesn't do it for you, you might start thinking about parallel_for with a custom Range, using auto_partitioner to keep the parallelism in check without having to spend too much effort with the tuning of an appropriate cut-off depth (too shallow means not enough slack, too deep means too much overhead).

(Added) Evaluate and possibly try the second suggestion first, because it applies to both a sequential and a parallel implementation.

jimdempseyatthecove · ‎06-25-2014

Using parallel_invoke to create 2 tasks, one of the tasks can take the original result, and the other task can use a otherResult, where otherResult is defined prior to scope of parallel_invoke. After completion of the parallel_invoke you will then need to append otherResult pairs (if any) to result.

Jim Dempsey

RafSchietekat · ‎06-25-2014

That's a lot of (costly) appending!

jimdempseyatthecove · ‎06-25-2014

Untested code:

void collide(const int i1, const std::vector<BVHNode> &nodes1,
             const int i2, const std::vector<BVHNode> &nodes2,
             std::vector<std::pair<int,int>> &result)
{
  const BVHNode &n1 = nodes1[i1];
  const BVHNode &n2 = nodes2[i2];

  //++num_overlap_tests;
  if (!n1.m_bound.overlap(n2.m_bound))
    return;

  if (n1.is_leaf_node())
  {
    if (n2.is_leaf_node())
      result.emplace_back(n1.m_primitive_idx, n2.m_primitive_idx);
    else
    {
      if ((n2.m_left_idx >= 0) && (n2.m_right_idx >= 0))
      {
        std::vector<std::pair<int,int>> otherResult;
        parallel_invoke(
          [&]() { collide(i1, nodes1, n2.m_left_idx, nodes2, result); },
          [&]() { collide(i1, nodes1, n2.m_right_idx, nodes2, otherResult); });
        // copy
        for(std::vector<std::pair<int,int>>::iterator it = otherResult.begin();
            it != otherResult.end();
            ++it) {
          result.emplace_back(it->first, it->second);
        }
      }
      else
      {
        if (n2.m_left_idx >= 0)
          collide(i1, nodes1, n2.m_left_idx, nodes2, result);
        if (n2.m_right_idx >= 0)
          collide(i1, nodes1, n2.m_right_idx, nodes2, result);
      }
    }
  }
  else
  {
       if ((n1.m_left_idx >= 0) && (n1.m_right_idx >= 0))
      {
        std::vector<std::pair<int,int>> otherResult;
        parallel_invoke(
          [&]() { collide(n1.m_left_idx, nodes1, i2, nodes2, result); },
          [&]() { collide(n1.m_right_idx, nodes1, i2, nodes2, otherResult); });
        // copy
        for(std::vector<std::pair<int,int>>::iterator it = otherResult.begin();
            it != otherResult.end();
            ++it) {
          result.emplace_back(it->first, it->second);
        }
      }
      else
      {
        if (n1.m_left_idx >= 0)
          collide(n1.m_left_idx, nodes1, i2, nodes2, result);
        if (n1.m_right_idx >= 0)
          collide(n1.m_right_idx, nodes1, i2, nodes2, result);
      }
  }
}

Jim Dempsey

jimdempseyatthecove · ‎06-25-2014

Note,

Yes, there is copying involved. As to if the overhead exceeds that of a concurrent vector, one will have to run the tests.

The copying an be reduced by "new"-ing the otherResult's, then in place of copy, pushback to a concurrent vector. (assuming parallel allocator in play) This way there is at most one atomic push per recursion level.

Then at end of tree parsing, tally up the final size requirement.

construct the end "result" vector with the known size

then run a parallel_for on the vector of results references.

Each pick performs an atomic exchange and add of the size of the picked vector to the count of entries in the result record. The former value of the count is the insertion point.

Perform block copy (note making one copy).

delete picked vector.

QED

Jim Dempsey

davidlofstrand · ‎06-26-2014

Hi again,

Thanks both for your suggestions.

The code now roughly runs another 100Hz faster but it didn't reach 1000Hz.

Unfortunately the last suggestion didn't work due to row 20 and 44. I think that the reason to why is that the vector dynamically constructs/destructs which hurts the performance a lot!

I tried out enumerable_thread_specific combined with a trivial left/right selection metric which worked ok. For this test code I added a global for the thread local storage which means that I can't call collide_parallel in parallel but that's fairly easy to work around anyway so I'll just post the code in it's current state.

  typedef std::pair<int, int> CollisionPair;
  typedef tbb::enumerable_thread_specific< std::vector<CollisionPair> > ResultVecVec;
  ResultVecVec gResultVectors;

  void collide(const int rec_level,
               const int i1, const std::vector<BVHNode> &nodes1,
               const int i2, const std::vector<BVHNode> &nodes2,
               std::vector<CollisionPair> &result)
  {
    const BVHNode &n1 = nodes1[i1];
    const BVHNode &n2 = nodes2[i2];

    //++num_overlap_tests;
    if (!n1.m_bound.overlap(n2.m_bound))
      return;

    if (n1.is_leaf_node())
    {
      if (n2.is_leaf_node())
        result.emplace_back(n1.m_primitive_idx, n2.m_primitive_idx);
      else
      {
        int left = (rec_level&1) ? n2.m_left_idx : n2.m_right_idx;
        int right = (rec_level&1) ? n2.m_right_idx : n2.m_left_idx;
        if (left >= 0)
          collide(rec_level + 1, i1, nodes1, left, nodes2, result);
        if (right >= 0)
          collide(rec_level + 1, i1, nodes1, right, nodes2, result);
      }
    }
    else
    {
      int left = (rec_level&1) ? n1.m_left_idx : n1.m_right_idx;
      int right = (rec_level&1) ? n1.m_right_idx : n1.m_left_idx;
      // TODO : add some selection metric between left/right
      if (left >= 0)
        collide(rec_level + 1, left, nodes1, i2, nodes2, result);
      if (right >= 0)
        collide(rec_level + 1, right, nodes1, i2, nodes2, result);
    }
  }

  const int cRecursionThreshold = 4;
  void collide_parallel(const int rec_level,
                        const int i1, const std::vector<BVHNode> &nodes1,
                        const int i2, const std::vector<BVHNode> &nodes2)
  {
    if (rec_level > cRecursionThreshold)
    {
      collide(rec_level, i1, nodes1, i2, nodes2, gResultVectors.local());
      return;
    }
    const BVHNode &n1 = nodes1[i1];
    const BVHNode &n2 = nodes2[i2];
    //++num_overlap_tests;
    if (!n1.m_bound.overlap(n2.m_bound))
      return;
    if (n1.is_leaf_node())
    {
      if (n2.is_leaf_node())
        gResultVectors.local().emplace_back(n1.m_primitive_idx, n2.m_primitive_idx);
      else
      {
        int left = rec_level&1 ? n2.m_left_idx : n2.m_right_idx;
        int right = rec_level&1 ? n2.m_right_idx : n2.m_left_idx;
        if (left >= 0 && right >= 0)
        {
          tbb::parallel_invoke(
            [&]() { collide_parallel(rec_level+1, i1, nodes1, left, nodes2); },
            [&]() { collide_parallel(rec_level+1, i1, nodes1, right, nodes2); });
        }
        else
        {
          if (left >= 0)
            collide_parallel(rec_level+1, i1, nodes1, left, nodes2);
          if (right >= 0)
            collide_parallel(rec_level+1, i1, nodes1, right, nodes2);
        }
      }
    }
    else
    {
      int left = rec_level&1 ? n1.m_left_idx : n1.m_right_idx;
      int right = rec_level&1 ? n1.m_right_idx : n1.m_left_idx;
      if (left >= 0 && right >= 0)
      {
        tbb::parallel_invoke(
          [&]() { collide_parallel(rec_level+1, left, nodes1, i2, nodes2); },
          [&]() { collide_parallel(rec_level+1, right, nodes1, i2, nodes2); });
      }
      else
      {
        if (left >= 0)
          collide_parallel(rec_level+1, left, nodes1, i2, nodes2);
        if (right >= 0)
          collide_parallel(rec_level+1, right, nodes1, i2, nodes2);
      }
    }
  }

  void collide_parallel(const int i1, const std::vector<BVHNode> &nodes1,
                        const int i2, const std::vector<BVHNode> &nodes2,
                        std::vector<CollisionPair> &result)
  {
    for (auto &ref : gResultVectors)
      ref.clear();
    //gResultVectors.clear();
    collide_parallel(0, i1, nodes1, i2, nodes2);
    for (const auto &ref : gResultVectors)
      result.insert(result.end(), ref.begin(), ref.end());
  }

I'm open for suggestions so if you should happen to have more ideas please let me know.

Just one side note here. I choose to loop over all vectors and clear each one specifically instead of clearing the thread local storage variable. The reason to this is that std::vector::clear() will retain the capacity of the internal vectors keeping memory (re)allocation to a minimum once you reached the most complex collision state. I didn't check the tbb implementation in detail to see if row 107, the comment, would work with respect to that as well but if so please let me know.

David Löfstrand

RafSchietekat · ‎06-26-2014

Maybe you can use gResultVectors directly with tbb::flattened2d.

I don't think you quite nailed the alternating descent yet. You're merely reordering the output, if an order is defined at all (it is for the sequential base case, but not for the parallel steps), but you're still fully descending into the left BVH tree before you start descending into the right BVH tree (left and right being the arguments to the functions, not the subnodes of any tree node). :-)

You should get a significant boost from correcting the alternating descent. The tbb::flattened2d could help a little bit, maybe.

Then you can still see whether you want to try the custom Range with a tbb::parallel_for(). That one might even work with parallel_reduce(), because there would be far fewer intermediate vectors to grow, append, and deallocate, since a tbb::parallel_reduce() Body is preferentially reused, but I still wouldn't try that except out of curiosity.

I agree with recycling those thread-local vectors.

(Added) Note that the suggestion to alternate the descent is to allow more compact code. There already is a duplication that can be avoided by changing (argument) orientation instead of using the n1 and n2 references, which are like const pointers (to be distinguished from pointers to const): use real pointers, so that you can swap them. You should separate the decision of which tree to descend into from the actual descent code. If neither side is a leaf node, always choose the left argument to descend into, but in all cases exchange arguments to the recursive call to achieve the effect of alternating without variables and tests, which in that case would have as effect calling recursive_function(right_argument, left_argument.left_subtree) and recursive_function(right_argument, left_argument.right_subtree), except that you wouldn't use orientation-based names because the code would be shared with the case where only one argument is a non-leaf. If result orientation is relevant, pass the current (argument) orientation along in the recursion, and use it to choose between two emplace_back() calls or to swap temporary variables before a common emplace_back(), whichever is more convenient.

jimdempseyatthecove · ‎06-27-2014

Raf has some good points.

On issue with using rec_level is when the tree isn't balanced, with respect to work, you will end up with threads starved for work.

To handle this, you could use a global atomic variable, initialized to 0, and incremented as each thread pops all the way out. Then at the point where you test to recurs or not, you include the atomic variable. Something like

if (rec_level - AtomicFinishedThreads > cRecursionThreshold)

You will have to determine if it is less expensive for the other threads to permit additional threads to call collide_parallel or if is less expensive to use a CAS section crafted to let only one thread take and decrement AtomicFinishedThreads.

Jim Dempsey

davidlofstrand · ‎06-27-2014

Thanks!

I'm going to look more into this problem next week.

RafSchietekat · ‎06-28-2014

Instead of alternating (even) descent, you might also consider preferentially prying open the largest bounding volume or the one needing the largest bounding cube (having the largest largest side), or some other heuristic, depending on the geometries being studied, the goal always being to short-circuit analysis as soon as possible (at the right price!).

That little argument-swapping idea I proposed earlier is just a one-trick pony (for alternating descent), and I'm not even sure that it does anything really useful as soon as an argument/result orientation value is getting passed (I didn't actually try it), but I also don't immediately see that it would work against other strategies if you start with it as the fall-back strategy. Still, this is just about convenience/preference (not about algorithm efficiency), so feel free to use or ignore.

jimdempseyatthecove · ‎06-28-2014

Raf,

I haven't dug around in the TBB source files (or read much of the internal functions). Is there a function that returns the number of threads that are able to work-steal? If so, then in place of recursion level if(tbb::HypotheticalFunctionToGetNumberOfThreadsAvailableToWorkSteal())

While you could get a race condition, it will only occur at the end points (start and end of outer most level).

Jim Dempsey

RafSchietekat · ‎06-28-2014

I would first explore use of auto_partitioner, as suggested earlier, because that is how TBB works out of the box. This is vulnerable in case of highly uneven work distribution, but still worth the first shot, I think.

(2014-06-29 Retracted speculative comments.)

Recursive collision code in tbb