Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Fast dynamic size thread specific storage

renorm
Beginner
398 Views
Some performance critical function is called from an inner loop. It needs to allocate a dynamic size storage. Naturally, the said function can be called concurrently from many threads. I need a critic of these two possible solutions. Input and output are not shown.
[cpp]typedef std::vector > storage_type

// solution one
void foo1() {
    storage_type storage;

    // n is computed at runtime
    storage.resize(n)
}

// solution two
void foo2() {
    static tbb::enumerable_thread_specific tls_storage;
    storage_type& storage = tls_storage.local();

    // n is computed at runtime
    storage.resize(n)
}[/cpp]

foo1 has to allocate on free store each time its called.

foo2 doesn't have to reallocate as long as the high watermark is not exceeded. Max storage size is relatively small, memory waste is not an issue.

Will foo2 work properly even after task_scheduler_init was reinitialized, provided that the storage is not used to transfer data between successive calls.
0 Kudos
7 Replies
RafSchietekat
Valued Contributor III
398 Views

As long as foo2() is not recursive, directly, indirectly in a straightforward manner, or... indirectly by way of the scheduler (in case the thread is sent out stealing), or something I haven't even thought of yet. :-)

(Added) Be sure to tell us how much you saved, if anything. The scalable allocator is supposed to be quite efficient, but without checking the code I'm not certain this wouldn't trigger any problematic behaviour (what happens if an allocator block is inserted into a bin for one allocation and that allocation is removed immediately afterwards, emptying that block?).

0 Kudos
renorm
Beginner
398 Views
The saving was about 15%. The vector size was about 100 (lines 8 and 17). Per each element of the vector the function performed 2 floating point oration, several bits-wise operation and 2 random memory reads.

The function is not recursive and doesn't block or pause in any other way. Can the scheduler stop it half way through and let the worker thread to run another task which in turn calls foo2?

I am not sure what you mean by blocks and bins. Is that related to the internals of TBB's allocator?
0 Kudos
RafSchietekat
Valued Contributor III
398 Views
"The saving was about 15%. The vector size was about 100 (lines 8 and 17). Per each element of the vector the function performed 2 floating point oration, several bits-wise operation and 2 random memory reads."
OK, that's significant, I guess, especially if you mean 15% less time instead of 15% faster. A bit surprising.

"The function is not recursive and doesn't block or pause in any other way. Can the scheduler stop it half way through and let the worker thread to run another task which in turn calls foo2?"
No.

"I am not sure what you mean by blocks and bins. Is that related to the internals of TBB's allocator?"
Yes. But now I think I remember that an empty block would not be actively recycled before a new allocation needs one, so I'm not sure where the allocator is still spending that time. Anyway, a good result, but only for selective use.

(Added) Well... 1/(1-0.15)-1=0.176... :-)
0 Kudos
renorm
Beginner
398 Views
That was the result of eyeballing. These are the exact numbers using ICC 11 and MS Visual C++ 2008.

Calls (millions) per second. Single threaded test (can't do multithreaded yet).

1) Storage provided by the caller: ICC = 116, MSVC = 112.

2) Static local TLS (foo2): ICC = 116, MSVC = 107.

3) Local vector (foo1): ICC = 91, MSVC = 96.

Case 1 requires interface change.

In another test compiled with ICC case 1 throughput was 187 vs case 3 throughput of 130.
0 Kudos
RafSchietekat
Valued Contributor III
398 Views
So basically TLS is free with ICC?

Still, if the initial dimensioning is final, you could also estimate that a fraction f of uses would fit in an automatic array of size N, with the remaining cases allocated dynamically, thus largely hiding the allocation cost (see Amdahl's formula), or go for a not strictly portable solution like using a variable to dimension an automatic array in all cases (supported by GNU, I think, don't know about other compilers) or alloca(). Just be reasonable with how much stack space you use (I don't have numbers there), and roll your own cache-line alignment (if really needed).
0 Kudos
renorm
Beginner
398 Views
That was only a single threaded run. Maybe ICC managed to optimize something away?

How about generic solution, maybe new TBB feature?
[cpp]struct raw_storage_stack;
struct scoped_reference;

// Reference to raw storage
struct scoped_reference
{
    // pointer to the raw storage 
    void* pointer();

    // pushes storage back into stack
    ~scoped_reference;
};

struct raw_storage_stack
{
    /* If stack is not empty and top is big enough, return it without new
       allocation. Otherwise allocate new storage and return it */
    scoped_reference next_reference(int size);
};

tbb::enumerable_thread_specific global_tls;

void foo() {
     // how much storage needed
    int n;

    scoped_reference ref = global_tls.local().next_reference(n);

    // storage big enough for n bytes
    void* ptr = ref.pointer();

    // ref expires here and the storage is pushed back into stack
}
[/cpp]
raw_storage_stack maintains a stack of pre-allocated pointers. Allocation is amortized, similar to std::vector growths behavior. scoped_reference knows the stack from which the storage came and pushes it back into the same stack upon destruction. Stack hides behind TLS and doesn't need to be synchronized. This works with recursive functions too. Memory allocation is amortized and works well with functions called from tight loops. After several cycles stack would have enough storage and no new allocation would be needed. Also, it may improve cache hits, because the storage would be recycled by the same thread and stack based storage allocation would decreases cache cooling.

A new feature would be justified if the problem is fairly commonplace where TBB is used.
0 Kudos
RafSchietekat
Valued Contributor III
398 Views

"A new feature would be justified if the problem is fairly commonplace where TBB is used."
That says it all, because it does not seem unlikely that only few functions/programs exhibit this pattern (lowercase p) of temporary dynamic-size allocation with little associated work to drown out the cost of allocation, which even here is only 15%. Until then, it seems fine to just use your solution, or go with something simple (like I proposed in #5), but in any case not to worry about it too much. :-)

0 Kudos
Reply