Re: Parallel programming and object-oriented programming

Vishal_O_ · ‎07-22-2008

I am starting to learn about parallel programming (e.g. multithreading) and wanted opinions about how to start coding amultithreaded applicationwith object-oriented techniques?

1. Have parallel code call, use, and create objects.

2. Have objects call portions of parallel code.

3. To combine the previous two approaches

If the last case is the best case the when should you have parallel code call, use, and create objects and when should you have objects call portions of parallel code.

P.S The main goal is performance and I am writing in C++

jimdempseyatthecove · ‎07-23-2008

Vishal,

If your main goal is performance then you must be willing to add some additional code to assist in boosting performance. Fortunately in C++ you can hide most of this from the rest of the application. One time consuming proceedure is malloc which is called at the lowest level of new. The reason for this is malloc must be serialized, usually by way of a critical section. Some of the malloc implementations do use wait free programming techniques but there is additional overhead none the less.

To reduce the number of malloc's a common practice is to allocate object pools. The size of the pool is dependent on the numbers and frequencies of allocation of a given type of object. As an example assume there are no objects of a given type allocated. On the 1st allocation you examine a list of pools header for that object typeand find that there are no pools available. In this case you allocate a pool of objects as one contiguous block of memory. In this example a block of memory equivilent toan array of 100 object sized memory blocks (add alignment if required). Then from this pool, obtain an available chunk of memory (you write the code, usually a linked list of free memory nodes). The pool allocation (malloc)occures less than 1% of the time during initial building of list of objects. But as objects are returned and reused the calls to malloc approach 0%. All of this additional coding is hidden by you writingoverloaded new and deletefunctions (you may also have to add new[] and delete[] overloaded functions as well).

When you allocate a n object in the above manner you will likely code using a critical section (or wait free style if you wish). However, now you will only experience contentions for the critical section for concurrent new(s) of objects of the same type. Whereas using the default new handler causes all objects of the default new handler to compete for the same critical section. Think of having 100 different types of objects with equal probability of allocation/deallocation. Using this method you now have at maximum1% of 1% of a chance (times number of threads) of an allocation/deallocation collision.

Before you go to the effort of writing an overloaded new operator it would be to your benefit to check on just how much overhead is cause by default new handler usage. Use Thread Profiler, PTU or some other profiling utility (e.g. CodeAnalyst from AMDs site).

Jim Dempsey

Dmitry_Vyukov · ‎08-29-2008

vishal@ams-us.com:
1. Have parallel code call, use, and create objects.

2. Have objects call portions of parallel code.

3. To combine the previous two approaches

In general case, I think, last case is the optimum. On such high level of abstraction, it's nearly always bad to initially limit yourself to only one particular approach.

vishal@ams-us.com:

If the last case is the best case the when should you have parallel code call, use, and create objects and when should you have objects call portions of parallel code.

It's really difficult to answer such question w/o context. The only thing I can say - It depends.

vishal@ams-us.com:

P.S The main goal is performance and I am writing in C++

Object orientation has nothing to do with performance. One can design scalable non-OO architecture, and can design non-scalable OO architecture. Or vice versa.

Personally I can recommend Actor-oriented programming (or Agent-oriented programming, or Active objects):
http://en.wikipedia.org/wiki/Actor_model
http://en.wikipedia.org/wiki/Active_Object

Dmitry_Vyukov · ‎08-29-2008

JimDempseyAtTheCove:
Some of the malloc implementations do use wait free programming techniques but there is additional overhead none the less.

Wait-free has nothing to do with performance. Wait-free is solely about forward-progress guarantees. Wait-free algorithm can be faster than traditional lock-based algorithm (for example, if it uses only one XCHG instruction, or not use any atomic RMW operations at all). Or it can be slower than lock-based one.

There is a class of synchronization algorithms which targets exactly at performance. They are faster than lock-based algorithms by definition. But there is no recognized name for this class of algorithms. I call them 'atomic-free'.
You can see some examples in this newsgroup:
http://groups.google.com/group/lock-free/topics

Dmitry_Vyukov · ‎08-29-2008

JimDempseyAtTheCove:

To reduce the number of malloc's a common practice is to allocate object pools. The size of the pool is dependent on the numbers and frequencies of allocation of a given type of object. As an example assume there are no objects of a given type allocated. On the 1st allocation you examine a list of pools header for that object typeand find that there are no pools available. In this case you allocate a pool of objects as one contiguous block of memory. In this example a block of memory equivilent toan array of 100 object sized memory blocks (add alignment if required). Then from this pool, obtain an available chunk of memory (you write the code, usually a linked list of free memory nodes). The pool allocation (malloc)occures less than 1% of the time during initial building of list of objects. But as objects are returned and reused the calls to malloc approach 0%. All of this additional coding is hidden by you writingoverloaded new and deletefunctions (you may also have to add new[] and delete[] overloaded functions as well).

Some programs use very small number of different object types (I mean object types which are created/destroyed frequently).Number of types can be as small as 3, or 2, or even 1. This can be some 'message' or 'request'.

So I usually use following scheme. Each thread has *private* memory allocator. All allocation request goes to this allocator. Free requests usually goes to this allocator too. But when thread must free object which was allocated by different thread, than thread use message-passing queue to pass this object to creator thread.
This scheme scales better than global pools of objects. And also avoids mentioned caveat with very small number of object types.
What do you think?

jimdempseyatthecove · ‎08-29-2008

Private pool allocators are useful and can greatly reduce the frequency of critical sections (or whatever serializing technique you use).

Returning objects allocated by other threads (assuming object size (and any sub objectsit contains) is relatively small) might be better suited by a convoy technique as opposed to messaging on each iteration. Each thread maintains a private free list of object sized nodes (one list per size). Additionally each threadwould maintain a publicfree list for each object size and for each thread (including itself). The lists would be a linked list LIFO. Each thread has unlocked access to their private free lists. Each thread has unlocked access to its list of thread associated free list for this object size (including count), however, for the thread's own entry in thethread associated free list for this object size it must use a CAS operation to grab the list head replacing with null.

a) When a thread returns an object it owns it has unlocked access to its private list
b) When a thread returns an object it does not own, it has unlocked access to a thread numbered list of list heads and returns to that list keeping track of count. When count exceeds threshold then the entire list is appended to lockable list of the other thread using CAS (note, no possibility for ABA in this case).
c) When a thread wants an object of a given size it first tries to obtain one from its private list
d) that failing, it examines the shared list that it owns and if items in list uses CAS to obtain entire list (replacing with NULL) and entire list (list head) copied to private list.
e) that failing malloc called to get one or several object sized items.

The point being

a) reduce the number of calls to malloc and free
b) reduce the interlocking technique to singleCAS as opposed to critical section/spinlock/mutex/other
c) reduce the frequency of CAS operations.

There is some expense of the potential for some small portion of memory being allocated but not used. A little extra code inthe outer moste loop can correct for this if desired.

Jim Dempsey

Dmitry_Vyukov · ‎08-29-2008

I think it's a great design! One of the most distributed and scalable designs!
The main drawback I see is that this design is not very reactive. I.e. there can be great amount of objects cached in private stacks. Assume that all N^2 private stacks are almost full.

Here can be following tweak.
Instead of N^2 private stacks, we can use N^2 shared single-producer/single-producer lightweight fifo queues. When a thread returns an object it does not own, it instantly enqueues object into appropriate queue. When thread fails to get object from private pool, it pools N shared queues until founds some object.
This design is reactive. I.e. there is no private caches.

Dmitry_Vyukov · ‎08-29-2008

Here is another optimization.
When a thread returns an object it does not own, it can leave an object for own usage. A thread will allocate some memory in future, so why not just leave this object for own future usage? In the end, we don't have to inevitably return all objects to their owners.
But here is a caveat. If thread only frees objects and doesn't allocate objects at all (this is possible in some producer-consumer scenarios), then such thread can collect all objects in system in it's private pool. To prevent this we must use some threshold:

void free(obj* o)

 if (is_my_object(o) || private_list_size < THRESHOLD)

 put_to_private_list(o);

 else

 send_to_owner(o);

Dmitry_Vyukov · ‎08-29-2008

JimDempseyAtTheCove:

Returning objects allocated by other threads (assuming object size (and any sub objectsit contains) is relatively small) might be better suited by a convoy technique as opposed to messaging on each iteration. Each thread maintains a private free list of object sized nodes (one list per size). Additionally each threadwould maintain a publicfree list for each object size and for each thread (including itself). The lists would be a linked list LIFO. Each thread has unlocked access to their private free lists. Each thread has unlocked access to its list of thread associated free list for this object size (including count), however, for the thread's own entry in thethread associated free list for this object size it must use a CAS operation to grab the list head replacing with null.

And here is another caveat. All such designs work poorly with producer-consumer pattern, when one thread constantly allocate objects, and another frees. For such situations I've designed specialized allocator for fifo-queues. Overhead per alloc/free is really very low, no atomic RMW operations/heavy memory fences at all.
The design outlined here:
http://groups.google.com/group/lock-free/browse_frm/thread/2c6f6250762385e1

jimdempseyatthecove · ‎08-29-2008

If the thread keeps the object for its own use later then you run into a problem if one thread always allocates and a different thread alwaysdeallocates.

Also, in a NUMA architecture objects allocated by owning thread have faster access than objects allocated by other thread.

Setting a "convoy" size to say 128 on relatively small objects might be a fair trade-off against a 99% reduction in conflicts for a single CAS. e.g. 128 byte objects, 16 core system would be a potential for worst case 16x16x128x128 idle objects (4MB). Virtually nothing to worry about. If you place a clean-up in an outer loop you can eliminate stale objects at little expense.

Jim Dempsey

Dmitry_Vyukov · ‎08-29-2008

JimDempseyAtTheCove:
If the thread keeps the object for its own use later then you run into a problem if one thread always allocates and a different thread alwaysdeallocates.

Agree. I outline it here:

http://softwarecommunity.intel.com/isn/Community/en-US/forums/permalink/30262396/30262389/ShowThread.aspx#30262389

JimDempseyAtTheCove:

Also, in a NUMA architecture objects allocated by owning thread have faster access than objects allocated by other thread.

Good point. I really miss this moment.

In NUMA envirionment there is strong motive to move all memory back to the owner... But there still can be some optimizations inside one NUMA node.

Hmmm... But here is exception to the exception to the rule.

Consider following situation. NUMA environment. Thread 1 must free object allocated by thread 2. Ok, we can pass this object to thread 2. But what if next action of thread 1 will be to send object to thread 2? In this situation we still can use some limited "privatization" of foreign objects. And we will be getting 2 actions by the price of 1: we send an object and return old object to owner in 1 action.

But API must provide following function:

object* allocate_object(int receiver_thread_id);

JimDempseyAtTheCove:

Setting a "convoy" size to say 128 on relatively small objects might be a fair trade-off against a 99% reduction in conflicts for a single CAS. e.g. 128 byte objects, 16 core system would be a potential for worst case 16x16x128x128 idle objects (4MB). Virtually nothing to worry about. If you place a clean-up in an outer loop you can eliminate stale objects at little expense.

I completely agree.

But the problem with decentralized amortized designs (like this), is that there is always some caveats. It's easy (for expert :) ) to develop memory allocation scheme for particular context. But it's nearly impossible to design one-size-fits-all decentralized amortized memory allocation scheme.
And with centralized lock-based design situation is the opposite. You get bad performance and non-scalability, but no caveats.

jimdempseyatthecove · ‎08-29-2008

In a NUMA system any thread running on any core can return an object allocated on any other core. So delete is not a problem.

Storing in a private queue may be a problem and this may add another permutation to the list of free nodes outlined before.

On a NUMA system you do not want to return objects to a generalized free pool as you loose control over the local.

Each sized (coarse grained) object would be returned to a sized pool per NUMA node similar to what was layed out in earlier post. The earlier post dealt with cache/thread issues, this deals with NUMA distance issues. Typically processor numbers within aNUMA node are adjacent logical processor numbers, and depending on if/how you setup thread affinity, adjacent thread numbers can generally reside in the same NUMA node. Therefore instead of adding to the permutation of list heads, it becomes a segregation of the existing permutation. No additional data tables would be required but some additional coding would be required in the consolidation routine of the overloaded delete and delete[].

>>But it's nearly impossible to design one-size-fits-all decentralized amortized memory allocation scheme.

Quite true. But even with a convoy size of 2 you halve the number of locks. Note, depending on working set size,a small number ofnodes could eliminate all locksbut the first n allocation.

Jim Dempsey

Dmitry_Vyukov · ‎08-30-2008

JimDempseyAtTheCove:
In a NUMA system any thread running on any core can return an object allocated on any other core. So delete is not a problem.

Storing in a private queue may be a problem and this may add another permutation to the list of free nodes outlined before.

On a NUMA system you do not want to return objects to a generalized free pool as you loose control over the local.

Each sized (coarse grained) object would be returned to a sized pool per NUMA node similar to what was layed out in earlier post. The earlier post dealt with cache/thread issues, this deals with NUMA distance issues. Typically processor numbers within aNUMA node are adjacent logical processor numbers, and depending on if/how you setup thread affinity, adjacent thread numbers can generally reside in the same NUMA node. Therefore instead of adding to the permutation of list heads, it becomes a segregation of the existing permutation. No additional data tables would be required but some additional coding would be required in the consolidation routine of the overloaded delete and delete[].
>>But it's nearly impossible to design one-size-fits-all decentralized amortized memory allocation scheme.
Quite true. But even with a convoy size of 2 you halve the number of locks. Note, depending on working set size,a small number ofnodes could eliminate all locksbut the first n allocation.

In general I agree.
But for some situations halving the number of locks is not enough.
And NUMA further complicates situation by supporting exponential explosion of possible designs of distributed amortized memory management.
What I am trying to say is that even radical design solutions (like "privatization") have right for existence... provided that they are augmented with list of caveats. Because they can fit perfectly into some contexts. And general-purpose designs can fit badly into some contexts too.
Consider following design of allocator:
http://groups.google.com/group/lock-free/browse_frm/thread/2c6f6250762385e1
Definitely it's not suitable as general-purpose allocator, but it works extremely well in producer-consumer environments.