parallel_tree

jsgonzal · ‎12-26-2007

Hi,

I dont know how to say this well, because I'm just starting to think about it, so I apologize if is not clear.

I have a Event-driven particle simulator that schedule the events (collisions between particles) in a CBT. When an event is attended the tree change, but "just a little", because particles collide, change their momentum and there are new collisions to attend. As this is a extended system I can attend an event that if is far enough of the present one without changing the evolution of the system.

The serial algorith is like this.
1.- Find first events.
2.- Make a CBT.
3.- Attend the first one (evolve the time and change momentum).
3.1.- Predict next events for the particles.
3.2- Add those events to the CBT.
3.3.- Search the first event in the tree.
3.4.- Go to 3

Parallel Version:

My idea is to attend in parallel events that are far enough from each other, because the "information" take a while to spread over the system (something like a light-cone in relativity).

So, my question: is there a way to do something like this? is this scheme applicable to other problems?

I'll be glad to extend the discussion in this.
Best,
sebastian

AJ13 · ‎12-31-2007

Hey,

This is an interesting problem. What do you mean by CBT?

It seems to me that you need to be able to divide your computational "universe" into 3D chunks (assuming you are working in 3D), and process chunks that are far from one anonther inwards to chunks that are close to one another. So your algorithm would start out most parallel at the outer edges, and become increasingly serial as you come closer to the center.

I believe that you would have to write a custom work scheduling algorithm to do this, using the TBB scheduler. The parallel_for construct might be useful, but I can't see any method by which you could construct a custom Range to control processing based on distance. Perhaps, instead you could process ranges like an onion, zooming into the center.

This of course doesn't help with the fact that some particles are close together within those "far" blocks of processing.... so that doesn't help you much. Some initial ideas though.

I am also interested in parallel simulation, and have written a discrete-event simulation engine using TBB called YetiSim, see www.yetisim.org. I use a state machine constructed dynamically in memory by the user, to control execution of the simulation. I am confronted by a similar problem to yours, in the sense that entities should be processed in a specific order, depending upon the type of simulation.

ARCH_R_Intel · ‎01-03-2008

Googling "parallel particle simulation" yields a lot of prior work. My impression is that space-based partitions are popular (i.e. exploiting the "light cone" property). Can you explain the tree structure in more detail? For example,if Y is a child of X in the tree, what does that mean in your tree?

jsgonzal · ‎01-03-2008

Hi, thanks for the answer.

CBT stand for Complet Binary Tree. It is used to find the minimum time in log(n) steps.

The problem with a static division of the space (yes, it's 3d) is that as the simulation is asynchronous the different chunks can have different times, and this can make the particles to penetrate into each other (I don't know if I already say that I'm simulating hard spheres).

I think that the way of doing this could be like the last example in the book of TBB, make a tbb::task that attend the collision in parallel if they are far enough.

jsgonzal · ‎01-04-2008

MADadrobiso:
Googling "parallel particle simulation" yields a lot of prior work. My impression is that space-based partitions are popular (i.e. exploiting the "light cone" property). Can you explain the tree structure in more detail? For example,if Y is a child of X in the tree, what does that mean in your tree?

Indeed, there is a parallel MPI version of the code ("Event-driven hard-particle molecular dynamics using bulk synchronous parallelism") but as far as I know, the implementation spend a lot of time waiting, so it's impractical (and our group don't use it). The nice thing about our algorithm is that is really fast compared with time driven algorithms, and if it could be multicored that will look really nice on my thesis :)

If Y is a child of X means that the time for the next event of Y if larger than X's next event. So, the root of the three is the first next event of all the system (and the one to be attended by the serial program).

Attached is a screenshot of my teacher's presentation, it says somthing like:
"Tree
The structure correspond to a Complete Binary Tree.
-The next event correspond to the particle 10 with t= 0.10.
-After attend it, the next event's time for particle 10 is t=0.18."

For more information: Efficient Simulations of Microscopic Fluids: Algorithm and Experiments

P. Cordero, M. Marn, and D. Risso

Chaos, Solitons & Fractals, volume 6, pages 95-104, (1995)

Efficient algorithms for many-body hard particle molecular dynamics
Source Journal of Computational Physics archive
Volume 109 , Issue 2 (December 1993) table of contents
Pages: 306 - 317
Year of Publication: 1993
ISSN:0021-9991

ARCH_R_Intel · ‎01-04-2008

Thanks for the detailed explanation. With real-time graphics, would make a flashy demo. (Aside: This reminds me of the time Martin Gardner wrote about how the devil plays pool and starts with a triangle that's 36 balls on a side, for a total of 666 balls. Nevada is the 36th state and there are 36 combinations of two dice.)

By the way, the CBT is providing functionality that is often known as a "priority queue". Wikipedia has an article with that title which seems to cover the common implementations.

You might try to apply time warp ideas or some other form of speculative execution. Here's a link to one article on time warp. (I've never used time warp, but studied it a little bit back in the late 80's as background for my thesis.) E.g., serially grab P events from the top of the tree, see if they are sufficiently separated ("space-like" in relativity parlance) and process them in parallel.

We're tried coding various concurrent priority queues/heaps in the past, and found them to be slower than the equivalent heap with a big fat lock around it. The overhead of atomic operations, and contention at the root, have been the problems. Perhaps what you need is not a tree, but a "shrub" with multiple roots, and somehow percolate stuff through the shrub so that "space-like" events end up at different roots?

jimdempseyatthecove · ‎01-04-2008

Sebastian,

Your colliding particle (sphere) application sounds like an interesting project. It is kind of the inverse effect of what I am working on. The research work that I am involved with uses a Finite Element simulation of tethers and objects (Space Elevator work). The tethers are composed of beads (mass, position, velocity, acceleration and a few other properties), connected by segments (material properties of the tether elastic area, length, length rate, Youngs Modulus, spring damping constant, and many other attributes). The objects are additional bodies (planets, moons, satellites, spacecraft/equipment attached to the tether).

The simulation has to handle the situation where a tether goes slack. In which case tension of the slack segment goes to zero. The simulation used to have a fixed time integration step size. This proved to be untenable when a segment transitioned from slack to taunt as well as taunt to slack. This transition is quite similar to your collision of spheres in that if your integration time is fixed and you step to far into the collision, the recoil from the collision is akin to a Flubber effect whereby more energy comes out of the collision than when into the collision.

The solution to both problems (my tether and your spheres) is to vary the integration step size such that if a collision were to occur that the step size coincided with the approximate minimal intrusion of the interface within the precision of the calculation. In my case I couldnt go below 1.0E-15 second but this is dependent on other factors in the model. (i.e. yours may be different).

In your simulation it was not disclosed if the interaction was only collision or if the interaction was due to additional properties (heat, electrostatic, gravitational, viscosity, etc). If collision only then it is minimumtime-of-flight. If the other interaction properties are involved then it would be the minimum of time of flight with whatever the desired fidelity of the other interactions was required.

The approach I would like to suggest, for the lack of a better term, is a collection of bean bags type of approach. Beans being particles and bean bag being a collection of beans to work with. The number of bean bags could potentially be dynamic but will likely gravitate towards a particular number for the given problem. The number of beans per bag is not static but will vary from time to time dependent on computational loads. The problem solving portion of the application would tend to have as many threads as there are available cores.

The complete algorithm is a bit too complex to list here but the general idea is to place the beans in a hierarchy dependent on minimum time-of-flight to collision. In the same bean bag, different bags owned by the same thread, bag owned by other thread and finally bag in free queue. The simulation advances through a process collisions, shuffle beans, shuffle bean bags, determine step size.

We could discuss this off thread if you desire

jim_dempsey@ameritech.net

jsgonzal · ‎01-04-2008

Thanks for the hint about priority queue. The documentation in our code is specially bad in that part (it was made by a Computer Scientist lol).

I like the idea of the "shrub" but I think that is complicated, because one should change the shrub every step (because of the distribution of events in space).

Maybe one can take a time interval dt and search the first N events that are disjoint (all the events are space-like "from" each other) in this interval. If the system is big enough and dt is small one can have a lot of this events and attend them in a parallel_for. This need an atomic queue, doesn't it?

How much slower was the concurrent priority queue? I spend just 5% of time in scheduling the queue and 75-80% in attending the events. So I have a lot of time to spend there.

(BTW, the papers in the 80's were really hard to read :) and I think that it's quite difficult implement these ideas of message passing into tbb...)

jsgonzal · ‎01-04-2008

Jim,

In principle, our simulation doesn't have a fixed time step, the particles with their collisions are the ones that set the evolution of the system. The system advance from one collision to other. Because of this, is much faster than a time driven integration (with a potential like leonard jones, for example).

The interactions between the particles are just the instantaneous collisions. Between collisions the particles move free under the effect of gravity, this makes the equation for the time of the collisions just quadratic (more relativity :)).

Best,
s

ARCH_R_Intel · ‎01-07-2008

I don't remember how much slower the concurrent priority queues were. I just remember they were slower than using a plain priority queue with a spin lock around it. So you might implement the atomic priority queue by using a a plain priority queue with a spin lock around it.

If N is relatively small, it might be possible to continuously process N events in flight instead of taking them out in chunks of N. The logic for each thread would be:

take event from priority queue
wait until event is "space-like" with respect to "in flight" events
add event to "in flight" events
process event
update priority queue with consequent events remove event from "in flight" events

Both the priority queue and the "in flight" set would have to be atomic or protected by a lock. There's nothing "off the shelf" in TBB to deal with this kind of scheduling, though I suspect that it can be done in TBB with a little effort (e.g. the logic resembles class tbb::pipeline somewhat, but not closely enough to use tbb::pipeline). It may be simplest to prototype it in pthreads, using condition variables to implement step 2.

If it works out well, then you could consider reworking it TBB-style, which would involve implementing the "wait" as parking a continuation task that would be spawned by the thread that removes the blocking "time-like" event in step 5. That typically achieves much lower overhead than condition variables, but is muchmore work, so I'd prototype using pthreads and condition variables first.