Solved: Re: Many Simultaneous Task Schedulers

turks · ‎08-31-2009

Though I'm getting great scalability within a single process (1 scheduler, 8 cores), it's also important to be able to start up multiple copies of the application. In this case it appears that the TBB schedulers collide with themselves - each one wants to take all the cores/cpu power and divide it among its own threads. Thus when I start up 4 copies of this application, I get no more than 2.5x the throughtput. It should be close to 4x, which is what it would be without TBB.
I'm using Ver 2.2, specifically using the concurrent_bounded_queue and pipeline classes.
Could one possibly have one task scheduler for multiple Windows XP processes?
Is TBB perhaps not designed for this, but rather for writing a single greatly scalable application?
Tnx,
Mitch

robert-reed · ‎08-31-2009

Quoting - turks

Though I'm getting great scalability within a single process (1 scheduler, 8 cores), it's also important to be able to start up multiple copies of the application. In this case it appears that the TBB schedulers collide with themselves - each one wants to take all the cores/cpu power and divide it among its own threads. Thus when I start up 4 copies of this application, I get no more than 2.5x the throughtput. It should be close to 4x, which is what it would be without TBB.

Could one possibly have one task scheduler for multiple Windows XP processes?
Is TBB perhaps not designed for this, but rather for writing a single greatly scalable application?

I think the world is heading towards this but isn't quite there yet. I've seen academic research and heard of plans from Microsoft to make the HW thread pool a system-allocated resource. In such an environment load balance from the available HW threads could be shared across processes as well as through hierarchies using different threaded libraries (e.g., sharing a common set of threads between TBB and OpenMP) within a process. But it's not quite there yet. For the present it remains a resource issue that is left to the applications to solve.

View solution in original post

jimdempseyatthecove · ‎08-31-2009

In your particular case of running multiple copies of the same process you might be able to "easily" rewrite the current process such that you provide an encapsulating context, then declare multiple of these encapsulating context within on process and run the result as one TBB process (with each encapsulating context representing the former processes). This won't work with multipl different applications.

You use of static variables may or may not affect the difficulty of using this approach.

Jim

robert-reed · ‎08-31-2009

Quoting - turks

Though I'm getting great scalability within a single process (1 scheduler, 8 cores), it's also important to be able to start up multiple copies of the application. In this case it appears that the TBB schedulers collide with themselves - each one wants to take all the cores/cpu power and divide it among its own threads. Thus when I start up 4 copies of this application, I get no more than 2.5x the throughtput. It should be close to 4x, which is what it would be without TBB.

Could one possibly have one task scheduler for multiple Windows XP processes?
Is TBB perhaps not designed for this, but rather for writing a single greatly scalable application?

I think the world is heading towards this but isn't quite there yet. I've seen academic research and heard of plans from Microsoft to make the HW thread pool a system-allocated resource. In such an environment load balance from the available HW threads could be shared across processes as well as through hierarchies using different threaded libraries (e.g., sharing a common set of threads between TBB and OpenMP) within a process. But it's not quite there yet. For the present it remains a resource issue that is left to the applications to solve.

turks · ‎08-31-2009

I understand. In this case, even though the applications are the same, they are too complex (multiple .exe spawning) to make one large process. In addition the full 3-4G address space available for a separate processes is a large part of the benefit. Thanks.

robert-reed · ‎08-31-2009

Quoting - turks

I understand. In this case, even though the applications are the same, they are too complex (multiple .exe spawning) to make one large process. In addition the full 3-4G address space available for a separate processes is a large part of the benefit. Thanks.

Are these various processes taking up the whole 3-4GB address space in their separate contexts? And how much physical memory is there to support this? 3-4GB * ? Otherwise aren't you still thrashing memory?

turks · ‎08-31-2009

Yes, they can take up to 3 or more Gb. We are therefore running on 64-bit Windows OS and make sure the machine has enough physical memory. We are not thrashing. I'm glad you think a single scheduler as part of the OS may be in the future. I was hoping that one could perhaps throttle down each task scheduler to take and manage only a portion of the total core resources. I'm now thinking that perhaps the use of the "concurrent_bounded_queue" may be the problem. That would be a different discussion.
Thanks.

robert-reed · ‎08-31-2009

Quoting - turks

I was hoping that one could perhaps throttle down each task scheduler to take and manage only a portion of the total core resources. I'm now thinking that perhaps the use of the "concurrent_bounded_queue" may be the problem. That would be a different discussion.

"Throttle-down" is the issue at the moment. Creation of the TBB thread pool is still pretty "static" in that you can change the size of the thread pool but only by throwing way the current one and creating a new one. We've actually provided sample code using a dynamic task_scheduler_init object so that it could be easily thrown away and recreated to give an application the ability to set thread pool size. As currently realized, however, it would be up to the application (or cooperating applications) to recognize the oversubscription and back off on their resource use.

But evolution to a common thread pool has already started, beginning at the leaves, of course: Intel provides an OpenMP runtime library, libiomp5md.dll, that uses an API shared with implementations of OpenMP bytheMicrosoft and GNU compilers, meaning that whatever mix of compilers you use with OpenMP, their runtime needs can all be handled by a single library. The challenges to come are much larger than this of course, but they are recognized as the way forward.

turks · ‎08-31-2009

A little bit of new information. I simplified the amount of processing by using only ONE task scheduler.
Profiling the code shows that almost 50% of the CPU time on an 8 core machine is being spent in the routine
"local_wait_for_all", which is internal to the tbb_dll.
My pipeline now has only one item.
The profiler indicates negligible time in my concurrent_bounded_queue "push" or "pop" routines.

What does all that CPU time being spent in "local_wait_for_all" indicate?

I previously used the "try_pop" in the unbounded version of the concurrent_queue. My app did the wait.
That didn't make a difference, so I'm currently letting TBB "pop" do any (apparently little) waiting.

Anton_M_Intel · ‎09-01-2009

Quoting - turks

Though I'm getting great scalability within a single process (1 scheduler, 8 cores), it's also important to be able to start up multiple copies of the application. In this case it appears that the TBB schedulers collide with themselves - each one wants to take all the cores/cpu power and divide it among its own threads. Thus when I start up 4 copies of this application, I get no more than 2.5x the throughtput. It should be close to 4x, which is what it would be without TBB.

Mitch, you could try specifying threads number manually. If you know number of parallel processes, you can pass it or number of working threads to the processes via command line and initialize scheduler using this number.

So, all the processes would work together on at most 8 threads - i.e. without oversubscription.

RafSchietekat · ‎09-01-2009

"I've seen academic research and heard of plans from Microsoft to make the HW thread pool a system-allocated resource."
You mean like Apple's Grand Central Dispatch, currently shipping as part of Mac OS X Snow Leopard for Intel-based Macs?

"But it's not quite there yet."
Hmm...

jimdempseyatthecove · ‎09-01-2009

Mitch,

Here is a hack that you can consider.

Assume you have a database in registry or file (your choice) that contains a list of records identifying the running applications wishing to share in a thread pool co-opetition. The database records containing (your design) something along the line of

Process ID
minComputeThreads
maxComputeThreads
IOthreads
priority
(other)

When a process starts, it adds its record to the database, when the process ends it removes its record from the database.

You write a function that given the process ID andcurrent records in the database, together with information about the platform, will produce an ideal number of compute threads recommended for use by the current application.

Assuming your apps are all TBB (adopt for mixtures with OpenMP).

At startup, each TBB thread will obtain (once from an InterlockedIncrement) a cardinal thread number (0, 1, 2, ...). And all threads are forced to call.

void ChangeNumberOfParticipants()
{
if(CardnalThreadNumber < IdealNumberOfThreads)
return;
InterlockedDecrement(&NumberOfParticipatingThreads); // initialized to all threads count
while(CardnalThreadNumber < IdealNumberOfThreads)
{
Sleep(YourPollingFrequencyHere);
PoleForChange();
}
InterlockedIncrement((&NumberOfParticipatingThreads);
}

This will cause the TBB threads to place themselves into a sleep/pole mode when required. And

Periodically, in an outer loop. When the master thread (0 cardinal number) notices when the new ideal number of threads is less thanthe current number of participating threads it schedules a ChangeNumberOfParticipants task.

Jim Dempsey

turks · ‎09-01-2009

Jim, if I understand it right, your hack is what Anton recommends, and you have detailed an implementation in a thread-safe manner. The idea is to manually throttle down each TBB task scheduler instance by specifying a smaller number of participating threads at startup or changing the number periodically on the fly.
The thing is, my number of threads added by TBB in each app is always 8 (1 per core). I do this by specifying the number of instances in my pipeline class (like nBuffers in the tbb example). If I go below threads, I'm then specifically reducing the useful throughput, rather than whatever eats up the CPU time when multiple schedulers collide. If I'm running 4 app instances, would the ideal number of threads be 2? If so, I've already drastically reduced the power of the app significantly. In this application, both simple and complex graphics jobs get thrown at the apps. For complex ones they need the power of all 8 cores operating within one app. Simple jobs don't need all 8 and their throughput is best increased by having more app instances running.

I produced a non-TBB version of the app for comparison. The TBB version blows it away on complex jobs, being 3-4 times faster on an 8 core machine. (The serial portion and Amdahl's law limits it from 8x).
For very simple jobs, however, I can start up 6 of the non-TBB apps and since they don't share resources, I get almost 6x the throughput. Starting up even 4 of the TBB apps on a simple job never gets me more than 2.5x; 6 would probably get me to 2.7x.

I therefore think the only solution to this is the OS-based single task scheduler one described as "in the works" above. We'll just have to wait. I'm working on detecting and auto switching between TBB and non-TBB versions depending on the type of job.

Thanks to all.

robert-reed · ‎09-01-2009

Quoting - turks

I therefore think the only solution to this is the OS-based single task scheduler one described as "in the works" above. We'll just have to wait. I'm working on detecting and auto switching between TBB and non-TBB versions depending on the type of job.

I think Jim's code still needs a little work to work, but if you can detect and autoswitch between TBB and non-TBB versions of the code, then you could switch between TBB versions of the code that create smaller thread pools. As Anton and I suggested, applications can control the size of their TBB thread pools by providinga numerical argument to the first task_scheduler_init object creation that they do. I could conceive of a behavior where the application checks on startup for any other copies running and then chooses either the default number of HW threads (maybe 8) or some smaller number (1 or 2).

Ignoring all the races between processes that ultimately such a scheme may have to deal with there's still a basic chicken-or-egg problem here: I start up the first copy--no competition--8 threads; I start a second and it takes one or two and so on, but my machine is still oversubscribed until the first thread finishes. This calls for a more dynamic method, where the original copy can detect the arrival of subsequent copies ofthe applicationand somehow dial back its thread pool by some means like Jim was trying to achieve with his code sample. TBB's non-premptive scheduler means each worker is (relatively) autonomous, so there might be some wait involved as the terminating thread pool finishes up its assigned tasks. I think some workable scheme might be derived from this although I think you could spend a lot of time getting the heuristics right for how big the replacement thread pool should be.

jimdempseyatthecove · ‎09-01-2009

Turks,

The TBB app always starts with number of threads = number of hardware threads (8 in your case). TBB assumes all threads are compute bound and will not preempt once started (excepting for cancel task...). TBB makes no requirements as to the number of threads that will be involved in running your app (tasks spawned). In the event that all threads but current thread are busy, current thread will run all tasks spawned by current thread assuming those spawned tasks complete prior to other busy threads.

The concept I outlined is to make threads over number of ideal threads to be "busy" in sleep/poll loop.

In this manner you never change the TBB thread pool size for a running app. In lieu of all extra threads polling, one could poll and the others could WaitForSingleObject. Start with the poll technique then improve the method of altering the running thread set. (keeping the TBB thread pool the same size).

"3-4 times faster on 8-core machine" Do you mean 4-core machine with HT (8 hardware threads) e.g. Single Core i7 920?.

If you mean dual socked x 4-core (8 cores without HT) then 3-4 times faster means

a) memory bound
b) I/O bound
c) too many locks
(and combinations of above)

I would then suggest using VTune or PTU to help find the hot spots and make corrections there.
I also expect that with reasonable amount of effort you should attain 85% of full capacity (~6.8x that of single core)

Jim Dempsey

Alexey-Kukanov · ‎09-02-2009

Quoting - turks

The thing is, my number of threads added by TBB in each app is always 8 (1 per core). I do this by specifying the number of instances in my pipeline class (like nBuffers in the tbb example). If I go below threads, I'm then specifically reducing the useful throughput, rather than whatever eats up the CPU time when multiple schedulers collide.

It sounds like you specify the argument to the method run() of tbb::pipeline, and expect it to set the number of threads. This is not the case actually. This parameter to run() is there to limit the number of data items simultaneously processed by the pipeline, primarily to limit the required amount of memory or other resources. Without the limit, it is theoretically possible that a pipeline starts processing a huge number of tokens and e.g. runs out of memory. Generally, it is recommended to choose this parameter basing on resource usage; if it exceeds the number of cores, it gives more opportunities to balance the load between threads.

As others already said in this discussion, the number of threads created by TBB can be throttled by the parameter to task_scheduler_init. Once there exists an instance ofthe task_scheduler_init class, all subsequently created instances of itdo not affect the number of threads. So in order to resize the TBB thread pool you should first destroy all active task_scheduler_init objects, and then create a new one with the desired number of threads.

turks · ‎09-03-2009

Jim,
It is a dual socketed x 4-core machine without HyperThreading, i.e. there are 8 real cores.
Witha single app running, using TBB, and allowing 8 objects in the pipeline, the 3-4x instead of 8 is due to the serial speed limitation of the code feeding the pipeline. If I stub out all the guts of actual work in the pipeline, the job goes 5x faster, i.e. it only take 20% of the time. Thus as Reiners describs Amdahl's Law, the other 80% can be reduced by a factor of 8 for 8 cores making it take 10%. That plus the serial 20% makes 30% or about 3.3x faster.
I'm fully ok with that.

The slowdown problem I'm having is when I start up 2-4 of these apps, each using TBB and its own task_scheduler,
the apps slow down and I get only 1.5-2.5x overall throughput. On an 8-core machine, if I start up 4 non-TBB based apps, the apps don't slow down and so I typically get close to 4x overall throughput.

I would be real curious if anyone has managed to start up 8 separate TBB-based apps and have them run anywhere near 85% of full speed each!

turks · ‎09-03-2009

I actually don't care about the actual number of threads. As noted above an 8-core machine will start up with 8 threads. I did not see any speed benefits from having more items in the pipeline than the number of actual hardware cores, so I did limit it to 8 max to save memory and other resources as you say.

I think the balancing of multiple processes and multiple task_scheduler_init objects is beyond the normal scope of TBB use, at least for me.

The question now becomes,
Why and where is there so much CPU usage (40%-50% even) when running a non-compute-intensive app with TBB on an 8-core machine and I limit the pipeline to having only 1 data item in it at most?
If, instead of limiting the pipeline to 1 data item, I instead turn off TBB and the pipeline (run everything serially), the CPU usage is, as expected, about 12% (1 of 8 cores). Using TBB and limitting the pipeline to 1 data item, to simulate the everything serial case as much as possible, I see CPU resource usage at 40%-50%, nothing close to 12%.

Is this TBB overhead and is it reduceable?

jimdempseyatthecove · ‎09-03-2009

Turks,

One of the other forum members can give you a specific answer, I can can give you a generalized answer that can point you/them in the right direction.

If I were to guess...

TBB likely has a tuning parameter equivalent to the OpenMP block time. This tuning parameter is used when a thread is unable to locate an additional task to work on. This parameter specifies the length of time or number of iterations the thread should continue to hunt for new tasks to arrive. After this time or number of iterations expire (without intervening arrival of new task), the thread then suspends.

When a single suchapplication is running on your system, the computation time spent hunting for more work interferes with the system Null Process. Meaning you are trading off some kilowatts against reduced latency.

When running your single 8-buffer pipeline, where the feeder is the bottleneck, the worker threads are spending a significant amount of time in the search for task (block time hunt for task function), I would guess from your figures a task arrives before expiration of the block time.

This, latency reducing overhead, introduced into the single TBB application reduces latency for that single TBB application at the expense of available Null Process run time .AND. any other process that wishes to run. Thus, starting up two or more such TBB applications, causes a reduction in available processing time for the other TBB applications, and those applications in return cause a reduction in processing time for your applicaiton. This reduction in processing time, now causes an increase in the time, or should I say a reduction in the feeder rate on the input side (and output side if present) of the pipe line, which then causes longer unproductive burn time in the in the search for task in your TBB app, which does the same for the other TBB apps, and likewise back and forth.

If there is such a block time (or equivalent to OpenMP "throughput"/"turn around") then experiment with that. If there isn't then dig into the code if you can, and add such an option.

What you want to experiment with is when no immediate task, use a very short block time first (typically using _mm_pause()), then fall back on using SwitchToThread(), then Sleep(0), then suspend thread.

---------------

I think you and the other readers did not quite understand what I was proposing in an earlier post.

As an experiment that can be quite easily done and will illustrate what I am talking about...

Set up your current app to init and run with 8 threads.

Time one instance of app running
Time two instances of the app running

Save the times

Now make a slight modification to your app

At startup, after init with the 8 threads, spawn 4 tasks that sit in Sleep loop(perhaps looking for Done flag).

Now time two instances of this app running

I expect, faster run times than in the prior run of two apps

The suggestion I made in the earlier post, is a means to dynamically change the number of threads entering and exiting this seemingly "useless" sleep till told not to task.

The purpose of this task is to "reduce" the wasted burn time looking for task that are not present.

Jim Dempsey

Alexey-Kukanov · ‎09-04-2009

Quoting - turks

The question now becomes,
Why and where is there so much CPU usage (40%-50% even) when running a non-compute-intensive app with TBB on an 8-core machine and I limit the pipeline to having only 1 data item in it at most?
If, instead of limiting the pipeline to 1 data item, I instead turn off TBB and the pipeline (run everything serially), the CPU usage is, as expected, about 12% (1 of 8 cores). Using TBB and limitting the pipeline to 1 data item, to simulate the everything serial case as much as possible, I see CPU resource usage at 40%-50%, nothing close to 12%.

Is this TBB overhead and is it reduceable?

Jim correctly noted that there is a parameter in TBB which determines how much time a thread looks for work to arrive somewhere. It is hard-coded, not adjustable.

The problem you see is that the current pipeline implementation spawns a new task for each data item passing through the pipeline, and it constantlywakes up all workers, for no good reason. In other words, the case of a single data item limitation is unusual, not really expected in production code, so the implementation did not care of its potential problems.

The good news is that we are going to address this soon.

AJ13 · ‎09-04-2009

I've read this post briefly, so perhaps I've missed a detail here. I noticed that you stated that you are using multiple processes so that you have a larger address space. Could you profile to determine how much of a role memory swapping is playing here?

Also, while I agree in principle that oversubscription might occur in this scenario, I'm not convinced that this is what is happening. If we could measure how much useful work is done per context switch, that would be a good indicator as to whether or not oversubscription is the problem. If I read your post correctly, you do have a version that uses threading primitives and not TBB. If we could measure from that application how much useful work is done per quantum, I feel we would have a better idea as to what is going on when the same measurement is made on the TBB code. In my opinion we need a good measure that gives us a criteria upon which to make a comparison between the applications, to find what is really happening.

I'm not sure if VTune can give us the right information or not. Perhaps a good measure would be the number of instructions retired divided by actual thread runtime? Since you have multiple processes and threads, I would think that a sum of instructions retired by the sum of runtimes would suffice. If anyone has a better metric, please suggest one.

I'll watch this thread, I'm very curious to see the results of this measurement on your application, if only just to satisfy my own curiosity.

Thanks,

AJ

jimdempseyatthecove · ‎09-04-2009

Quoting - Alexey Kukanov (Intel)

Jim correctly noted that there is a parameter in TBB which determines how much time a thread looks for work to arrive somewhere. It is hard-coded, not adjustable.

The problem you see is that the current pipeline implementation spawns a new task for each data item passing through the pipeline, and it constantlywakes up all workers, for no good reason. In other words, the case of a single data item limitation is unusual, not really expected in production code, so the implementation did not care of its potential problems.

The good news is that we are going to address this soon.

Hmm, sounds like an impatient programmer symptom.

QuickThread has a tunable parameter (per thread wait context), and only wakes up 0 or 1 thread depending on state of other thread(s). No wake up if all are busy, no wake up if desired thread is scrounging (in block time hunt for task), 1 thread if thread were sleeping. This reduces the unproductiveoverhead in unnecessary context switches. (i.e. less interference to other processes running on the system).

At least the TBB team recognizes this and will address this issue.

In the case of running multiple multi-threaded applications, I agree that there is an adverse performance interaction using most task scheduling systems (TBB, QuickThread, and newer OpenMP features) where each assumes full system processing resources are available for their use alone. QuickThread will address this issue after we get V1.0 out to end users. The functionality for implementing dynamically tuning/de-tuning is built in to QuickThreadbut is yet to be enabled and thoroughly tested.

Jim Dempsey