Re: Memory topics

baffe · ‎06-23-2008

Hi!

I am evaluating TBB (so far only parallel_for) for possible use in our application.
So far, I am very satisfied with the results, but I have a few questions:




1. Since our application runs in real time, and has high demands on reliability, we
have a general rule that memory allocation/deallocation on the heap is forbidden during
run time. I have tried to read the soruce code (task.cpp), and as far as I could see,
TBB obeys this rule in the sense that parallel_for can cause memory to be allocated
on the heap the first few times it is called, but that this memory is re-used, so that
after a while, no more memory is allocated on the heap. Please let me know if this
understanding is correct!
2. One step in our computations requires a memory working area. So far, we have solved
this by declaring a fairly large (25 kB) static object. I understand that this is not
feasible in connection with parallel_for, since threads may access this object
concurrently. For now, I solved this by putting the memory on the stack, declaring
it as a local variable. But I suppose it is inefficient to repeatedly put such a
large object on the stack. Is there a better way to do it, e.g. by declaring thread
specific static objects, so that only one object per thread is created during the
entire program life time?
3. Are the threads created by parallel_for equivalent to Windows threads created by
_beginthread in the sense that all synchronization commands (e.g. spin_mutex or
CriticalSection) can be used to synchronize both kinds of threads?

robert-reed · ‎06-23-2008

Thanks for your interest in Intel Threading Building Blocks.Regarding your questions:

1. Since a prime philosophy driving TBB is to maximize cache reuse, early allocation and stable memory use are important design goals. Real time memory allocation is minimized but not forbidden. For example, the scalable allocator on first allocation whacks off a big chunk of memory, which it divides into separate pools for per-thread allocation, so youd see some heap activity right at the beginning. Likewise, data structures like concurrent_hash_map do an initial allocation sufficient to handle a nominal group of entries but uses binary growth should more space be needed. The threads used to handle TBB tasks are also allocated to a pool initially and reused, minimizing allocation thrash. If your program stabilizes in its resource use, heap allocations should also be fairly stable.

2. The latest release of Intel TBB provides a means to set thread stack size to enable the stack allocation for larger buffers such as in your current practice. It sounds, though, like youd really like to have some Thread Local Storage allocated for each of the pool threads. One of the new Intel TBB 2.1 features, the task_scheduler_observer, provides the hooks you need to set up a per-thread storage area for the TBB worker threads. I explain how to do it in my under the hood blog series.

3. The parallel_for doesnt actually create any threads. Those are created and pooled when you create the task_scheduler_init object and stay around as long as that object exists. These are native threads, spawned in whichever Intel TBB supported OS your program happens to be running. The parallel_for submits a task to the TBB scheduler, which under parallel_for and using the blocked_range can split that task into a bunch of smaller ones and allocate pool threads to execute various subranges of the original range. But just because you can, it doesnt necessarily mean you should. Intel TBB enabled programs run most efficiently when you let its unfair scheduler maximize parallelism and minimize cache thrashing by avoiding synchronization as much as is feasible for the algorithm.

baffe · ‎06-25-2008

Thanks a lot!

I am still a bit concerned about the heap activities (due to the words "should" and
"fairly" in the answer!). I guess it is very application dependent, but consider the
following code snippet, borrowed from the TBB tutorial:

const int size = 1000;
void g(double& x); // Does some heavy job on x

class Worker
{
Worker(double *a) : m_a(a) {}
void operator() (const tbb::blocked_range& r) const
{
for (size_t i = r.begin(); i != r.end(); ++i)
{
g(m_a);
}
}
private:
double* const m_a;
};

void myFunction(double *a)
{
// Read values to a from somewhere
parallel_for(tbb::blocked_range(0, size, 50), Worker(a));
// Do something clever with the result
}

Assume that the only thing the program does is to repeatedly call myFunction
(with a pre-allocated array), and that the left-out parts are harmless. Then,
is there a guarantee that after the first few calls, no more memory is allocated
on the heap?

Alexey-Kukanov · ‎06-26-2008

baffe:
Then, is there a guarantee that after the first few calls, no more memory is allocated on the heap?

Formally, there is no such guarantee.In TBB, task stealing is random-based and thus task distribution between threads varies from run to run. As the pools of reusable task objects are per thread, any particular thread in any particular run repetition may fall short of available task objects and request memory allocation.

Practically, after several runs I would expect memory consumption to stop noticeable increase, though sometimes allocation of a couple more tasks still can happen as I described above. If TBB is used together with the TBB memory allocator, allocation of additional task objects will have little overhead in average, because the memory allocator serves requests for small objects (such as parallel_for tasks) from a preallocated block of virtual memory, without any kernel calls (unless the preallocated block ends), and by using fast algorithms. So it might be not that bad even if happens in the middle of execution.

You might run some experiments to check memory consumption behavior. If you decide to do that, try creating a test that is close to how you would actually use parallel_for and/or other TBB constructs in your application; an example that needs about 20 task objects(as the one above) will definitely have different memory behavior than that requiring thousands of objects.

baffe · ‎09-02-2008

Tanks!

Actually, memory allocation in itself need not be a problem, as long as memory is not returned and allocated anew (e.g. new and delete in C++), since this may lead to memory fragmentation.So I might weaken the question: Is there a guarantee that the tasks allocated remain in use for the entire program life-cycle? Is there even a limit to the number of tasks that can be needed, so that the program might end up having allocated sufficiently many tasks, and will perform no further dynamic memory allocation?

You also recommend using TBB in comination with the TBB memory allocator. How do I do that?

Alexey-Kukanov · ‎09-02-2008

I inspected the code, and I can say that under some conditions, there is such guarantee in TBB 2.1.

First, make sure your master thread keeps at least one task_scheduler_init object alive until you know for sure TBB is no more used bythis thread. If you use TBB from several threads in your application, the above said relates to every one. Once you delete the last task_scheduler_init object in a thread, the internal scheduler structures for that thread, including task pool and task objects, aredeleted. Once you delete the last task_scheduler_init object in the whole program, TBB worker threads are shut down as well.

Second, make sure your task objects are "small". If you only use TBB algorithms such as parallel_for etc, you are safe here. Otherwise, ensure that the objects you inherit from tbb::task are less than 192 bytes in size (this constant might change in future).

Small tasks are not deallocated until the task pool of a thread is destroyed, thus as long as you stick to the above two "rules", you will have the guarantee you want.

For using the TBB memory allocator for tasks and other allocations inside TBB, you should do almost nothing - just ensure that the shared library for the allocator (e.g. on Linux it would be libtbbmalloc.so.2) can be dynamically loaded (on Linux, by dlopen). You can check whether tbbmalloc is used or not by setting TBB_VERSION environment variable to 1, and running a TBB test, or possibly your application. The library will then print some info into stdout, including the allocator in use:

TBB: VERSION 2.1
TBB: INTERFACE VERSION 3011
TBB: BUILD_DATE Fri, 9 May 2008 16:04:43 UTC
...
TBB: ALLOCATOR scalable_malloc
TBB: SCHEDULER Intel