I recently re-implemented the scalable_allocator from TBB for a research assignment at university, but was forbidden from looking at other allocator code as a requirement for the course. Now that the assignment is over and handed in, I have a quick question that is lingering from my own implementation ... (I'm very busy with exams now, so I hope someone has a few minutes who knows the code already to satisfy my curiosity).... and this issue was not described in the papers I read.
What happens when multiple threads request a new block from the OS? In particular, as I understand McRT-malloc and scalable_allocator, during initialization of a program (when I would expect a sudden burst of frequent memory requests) many blocks are going to be concurrently fetched from the OS itself. How is this handled by scalable_malloc in a concurrent manner? In our assignment, our OS allocator was not thread safe and so we had to find some way of ensuring mutual exclusion.
In my own implementation, I included an atomic int such that each thread that wanted to request a global block incremented this integer to signal their desire for a new block. Each thread would then attempt to enter the critical section, and the thread which did actually enter could make a large enough request to satisfy the block requirements for all waiting threads based on the value of this atomic int. The threads that did not gain access to the critical section would check frequently for new blocks (perhaps released by other threads), or attempt to enter the critical section themselves. The reasoning here is that threads could safely signal a thread that has entered the critical section that it would like a block, without actually interfering with mutual exclusion. I was quite proud of this optimization, and would love to hear comments from the experts.
There are some other lingering questions on things that I think are inefficient in the TBB allocator (as described in the papers) but I I really need to look at the code to check how it was handled there.
Wouldn't this only be meaningful for very short-lived programs that need little memory, that might as well be serial? And wouldn't other programs suffer additional contention on that atomic int instead of deriving a benefit?
Hey, I've been on other other end of the parallel world exploring GPUs for the past couple years to understand a very different approach to parallel computing. In the next couple years (starting after my graduation in spring) my efforts will focus on merging what I've leared into something really cool.
What you have missed is how the scalable_allocator gets memory, or at least how is described in the papers. Memory requests are rounded up to a size that is managed by bins, and each bin is a linked-list of blocks which can service those allocation requests. Allocation requests, hopefully, can just recycle free space from blocks which are private to the thread. There are a few other steps I won't go into but if there is no free space the thread has no choice but to use some synchronization and attempt to get a fresh block for itself from some store shared amongst all threads. The situation I describe, is what happens when that shared store is itself out of blocks... and all threads might hit the same situation at the same time. The cost of an atomic here is miniscule in comparison to the cost of memory allocation which has to be done by an OS system call. Yes, this situation should not happen very often... but it will likely happen at the initialization of the program when you have many threads hungry for memory. There should be a lot of contention on the shared store to service blocks at this time, and they simply might not be there. Eventually the requests for blocks should level out (otherwise the application will run out of memory anyways), so this optimization is aimed at initialization.
Again, there might be some magic there already... if there isn't this might be a useful strategy to employ. There might be even more intelligent things to do, like looking at simple statistical data to predict sudden bursts of allocation requests... but it will also be important to be able to backtrack if the prediction is wrong (i.e. free blocks).
OK, the cost of the atomic will be negligible, but so will the benefit. What are the odds of contention for system memory except at the beginning, which would be amortised over the lifetime of most programs? Only if startup latency is critical (buthow long do these requests take compared to starting up those threads), or if you are certain that only one request will suffice for all further needs of the application (and many instances will be run), would there be any benefit, as far as I can tell. So it would have to be an almost trivial code change to be worth it, I think.