Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

use TBB scalable allocator for both multi-threading and single-threading purpose

fenglai
Beginner
395 Views
Hello there. I have two matrix classes. The first matrix class is for normal use, where I use the STL allocator. This matrix class is designed for using in single thread environment. On the other hand, my code is also running heavy calculation in multi-threading environment, in which I use TBB to do parallelization. Inside the thread I designed the local matrix class (local to a single thread in multithreading environment), and it use the TBB scalable allocator.

Both of the two matrix classes are nearly same, and I want merge them together. I think that if the TBB scalable allocator could be used for single
thread environment, then I can use the TBB scalable allocator for both of the two purposes. Therefore, my question is; can we use the TBB scalable allocator for both multi-threading and single threading purpose?
 
Thanks a lot!
 
 
0 Kudos
6 Replies
RafSchietekat
Valued Contributor III
395 Views

There's no reason not to use the scalable allocator also in a single-threaded environment. It is scalable because it can very efficiently deallocate to the original thread, and that happens to be the situation in a single-threaded environment. (The only thing to potentially worry about is memory overhead for larger allocations.)

0 Kudos
fenglai
Beginner
395 Views

Thank you very much Raf!

Since the code may work on big matrix (row/column dimension is like 10000-100000), so how much of the overhead typically for large allocations?

Thank you again,

fenglai

0 Kudos
fenglai
Beginner
395 Views

I am trying to explore the clue shown by Raf, to see what is the overhead for large allocation. What I mostly concern is the memory overhead.

I wrote a simple program, just to allocate a big double precision vector (40000*40000), and initialize it with zero. I use smaps to monitor  its memory usage, and also timing the code. The result is like below.

For the memory usage, smaps gives the result like this:

7f0510ea1000-7f080c022000 rw-p 00000000 00:00 0
Size:           12502532 kB
Rss:            12500024 kB
Pss:            12500024 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:  12500024 kB
Referenced:     12500024 kB
Anonymous:      12500024 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB

I think this section should correspond to the large vector generation. since double precision vector with size of 40000*40000 will occupy 12500000kb, the total pages used showed above that the memory overhead is really very small.

The timing for TBB scalable allocator is like below:

time with generation of v5.630471ector in seconds   22.520830   
time with initilization of vector in seconds   5.876563

In contrast, the stl vector gives the timing of:

time with generation of vector in seconds  5.630471 
time with initilization of vector in seconds   5.704782

The timing overhead is trivial for me. Hopefully my test does not have something wrong inside. If it has, please correct me.

Thank you!

fenglai

 

 

 

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

Sorry for leaving that open to interpretation: very large allocations will indeed have relatively little overhead, but some range in between may still have relatively large overhead (I don't have any details about that).

0 Kudos
jimdempseyatthecove
Honored Contributor III
395 Views

And what are the timings for the 2nd time you do this after you return the memory (within the same program). IOW put a loop around your timing code.

SOP for scalable allocator is: Incur an expense on first allocation, reap benefits on return and re-allocation of same sized object. Your test case appears to be use once.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
395 Views

I think I overlooked those 22.520830 seconds there... really?

But I also don't see what was meant by "time with generation"/"time with initialisation" (annoyingly, C++ doesn't allow you to forego element initialisation, not even for simple types). Is the latter a separate loop?

0 Kudos
Reply