Solved: set the hard limit of workers doesn't work

guozhu_c_ · ‎03-16-2016

due to very large memory usage of each TBB thread in my logic, I must set the max worker thread TBB can use

at first I set the worker number to 24 in the task_scheduler_init on the 24-core machine,but sometimes there exists the 25,26 or 27-th threads, which cause the program OOM.

then I hacked the code in global market (market.cpp line 140 in tbb4.4 ), set the hard limit factor to 1, then I think the hard limit is 24 now, but there also appear the 25-th thread after sometime, cause OOMagain

how can I set the real hard limit in TBB ? the public API or change the source code both can be accepted?

thank you

Alexei_K_Intel · ‎03-22-2016

Sorry for a long respond. It should be noted that Intel TBB does not guarantee the number of created threads. Intel TBB only guarantees the maximum concurrency inside a task arena. However, the threads working inside an arena can leave it and other thread may join abandoned "slots". Intel TBB always create an implicit task arena and task_scheduler_init can limit its concurrency. Pay attention that any Intel TBB algorithm called before task_scheduler_init will create the implicit task arena with default concurrency and task_scheduler_init will not have any effect.

To overcome the issue that you are observing you can use the task_scheduler_observer functionality. The idea is to track the "entry" and "exit" events when a thread joins and leave an arena accordingly. Therefore, we can allocate the buffer when the thread joins the arena in the "entry" callback and deallocate memory when the thread leaves the arena. The following code snippet demonstrates the idea:

class AllocObserver: public tbb::task_scheduler_observer {
public:
        AllocObserver() : task_scheduler_observer(true) {}
        virtual void on_scheduler_entry( bool ) {
            myThreadPtr = malloc(SIZE);
        }
        virtual void on_scheduler_exit( bool ) {
            free(myThreadPtr);
        }
};

Do not forget to call the observe method to enable observation:

AllocObserver allocObserver;
allocObserver.observe(true);

In case if allocations turn out too expensive, you can implement the "caching" mechanism: in the "exit" callback the thread does not deallocate its memory object but place it in a container, e.g. std::stack<void*> guarded by a mutex. In the "entry" callback the thread can check if there is an available memory block and use it instead of additional allocations.

I hope that local observers will help you. In any case, feel free to post questions if you face any Intel TBB related issues.

View solution in original post

Alexei_K_Intel · ‎03-17-2016

It is really possible to cause TBB to create more threads than requested in task_scheduler_init. Usually it does not lead any significant issues because a thread uses several megabytes for its stack (yes, TBB also allocates some data structures for each thread but the order of magnitude should not be more). So it looks strange that you face OOM issue because of a couple of additional threads. Could you describe your application? Do you specify a stack size in task_scheduler_init? Do you use other parallel libraries explicitly or implicitly (e.g. OpenMP via other library call)? Do you use dynamic memory allocations extensively?

You can try to use the global_control functionality to limit the number of threads but, in my opinion, you need to understand the root cause of OOM because 25-26-27 threads too small number to be a real root cause.

guozhu_c_ · ‎03-17-2016

in my application, each TBB thread allocated nearly 2GB dynamic memory by myself code, and the default 24 threads will use 60GB memory along with some other component, the machine is 24-core, 64GB memory, so if there appear the 25-th threads, when it try to dynamic allocate 2GB memory more, then memory will be exhausted, then OOM. so it's not the stack-size but the dynamic memory cause this problem.

I want to set a hard limit , like 24, to ensure there will be 24 work threads at most in my application. but as said in the first post, change the initial value or hack the global market's hard limit factor both failed. does there some way to do this ?

I read some code about the worker thread limit, it seems global_control set the soft limit ?

thanks for your help.

Alexei_K_Intel · ‎03-17-2016

The global_control really sets so called "soft_limit", however, it is an implementation term and should be considered as max_allowed_parallelism.

It does not seem an elegant solution to change the TBB source codes. I'd suggest understanding the root cause of additional threads in your application and avoid/fix it (because it is really rare case). Does your machine have hyper-threading, i.e. if it has 24 cores, does it have 48 threads? How do you detect new threads and allocate buffers for them?

guozhu_c_ · ‎03-17-2016

I detect new threads in this way: in the "callback" function object, which is the third param of parallel_for(), use pthread_get_specific() to get the thread-local pointer, if it's NULL, allocated it dynamic, or use it otherwise.

the 25-th thread appears not so frequently, about 1 time in 100, or 2~3 days in the online service, cause oom when it appears. it's rare case but have a bad result.

the CPU is 12 core, and 24 threads (hyper-threading)

on why additional threads, does TBB have some common rules ? such as some threads take too long time to execute or wait on the mutex , the the TBB decide to generate new threads ?

thanks

Alexei_K_Intel · ‎03-22-2016

Sorry for a long respond. It should be noted that Intel TBB does not guarantee the number of created threads. Intel TBB only guarantees the maximum concurrency inside a task arena. However, the threads working inside an arena can leave it and other thread may join abandoned "slots". Intel TBB always create an implicit task arena and task_scheduler_init can limit its concurrency. Pay attention that any Intel TBB algorithm called before task_scheduler_init will create the implicit task arena with default concurrency and task_scheduler_init will not have any effect.

To overcome the issue that you are observing you can use the task_scheduler_observer functionality. The idea is to track the "entry" and "exit" events when a thread joins and leave an arena accordingly. Therefore, we can allocate the buffer when the thread joins the arena in the "entry" callback and deallocate memory when the thread leaves the arena. The following code snippet demonstrates the idea:

class AllocObserver: public tbb::task_scheduler_observer {
public:
        AllocObserver() : task_scheduler_observer(true) {}
        virtual void on_scheduler_entry( bool ) {
            myThreadPtr = malloc(SIZE);
        }
        virtual void on_scheduler_exit( bool ) {
            free(myThreadPtr);
        }
};

Do not forget to call the observe method to enable observation:

AllocObserver allocObserver;
allocObserver.observe(true);

In case if allocations turn out too expensive, you can implement the "caching" mechanism: in the "exit" callback the thread does not deallocate its memory object but place it in a container, e.g. std::stack<void*> guarded by a mutex. In the "entry" callback the thread can check if there is an available memory block and use it instead of additional allocations.

I hope that local observers will help you. In any case, feel free to post questions if you face any Intel TBB related issues.