There's no reason not to use the scalable allocator also in a single-threaded environment. It is scalable because it can very efficiently deallocate to the original thread, and that happens to be the situation in a single-threaded environment. (The only thing to potentially worry about is memory overhead for larger allocations.)
Thank you very much Raf!
Since the code may work on big matrix (row/column dimension is like 10000-100000), so how much of the overhead typically for large allocations?
Thank you again,
I am trying to explore the clue shown by Raf, to see what is the overhead for large allocation. What I mostly concern is the memory overhead.
I wrote a simple program, just to allocate a big double precision vector (40000*40000), and initialize it with zero. I use smaps to monitor its memory usage, and also timing the code. The result is like below.
For the memory usage, smaps gives the result like this:
7f0510ea1000-7f080c022000 rw-p 00000000 00:00 0
Size: 12502532 kB
Rss: 12500024 kB
Pss: 12500024 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 12500024 kB
Referenced: 12500024 kB
Anonymous: 12500024 kB
AnonHugePages: 0 kB
Swap: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
I think this section should correspond to the large vector generation. since double precision vector with size of 40000*40000 will occupy 12500000kb, the total pages used showed above that the memory overhead is really very small.
The timing for TBB scalable allocator is like below:
time with generation of v5.630471ector in seconds 22.520830
time with initilization of vector in seconds 5.876563
In contrast, the stl vector gives the timing of:
time with generation of vector in seconds 5.630471
time with initilization of vector in seconds 5.704782
The timing overhead is trivial for me. Hopefully my test does not have something wrong inside. If it has, please correct me.
Sorry for leaving that open to interpretation: very large allocations will indeed have relatively little overhead, but some range in between may still have relatively large overhead (I don't have any details about that).
And what are the timings for the 2nd time you do this after you return the memory (within the same program). IOW put a loop around your timing code.
SOP for scalable allocator is: Incur an expense on first allocation, reap benefits on return and re-allocation of same sized object. Your test case appears to be use once.
I think I overlooked those 22.520830 seconds there... really?
But I also don't see what was meant by "time with generation"/"time with initialisation" (annoyingly, C++ doesn't allow you to forego element initialisation, not even for simple types). Is the latter a separate loop?