- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm using Ver 2.2, specifically using the concurrent_bounded_queue and pipeline classes.
Could one possibly have one task scheduler for multiple Windows XP processes?
Is TBB perhaps not designed for this, but rather for writing a single greatly scalable application?
Tnx,
Mitch
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could one possibly have one task scheduler for multiple Windows XP processes?
Is TBB perhaps not designed for this, but rather for writing a single greatly scalable application?
I think the world is heading towards this but isn't quite there yet. I've seen academic research and heard of plans from Microsoft to make the HW thread pool a system-allocated resource. In such an environment load balance from the available HW threads could be shared across processes as well as through hierarchies using different threaded libraries (e.g., sharing a common set of threads between TBB and OpenMP) within a process. But it's not quite there yet. For the present it remains a resource issue that is left to the applications to solve.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your particular case of running multiple copies of the same process you might be able to "easily" rewrite the current process such that you provide an encapsulating context, then declare multiple of these encapsulating context within on process and run the result as one TBB process (with each encapsulating context representing the former processes). This won't work with multipl different applications.
You use of static variables may or may not affect the difficulty of using this approach.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could one possibly have one task scheduler for multiple Windows XP processes?
Is TBB perhaps not designed for this, but rather for writing a single greatly scalable application?
I think the world is heading towards this but isn't quite there yet. I've seen academic research and heard of plans from Microsoft to make the HW thread pool a system-allocated resource. In such an environment load balance from the available HW threads could be shared across processes as well as through hierarchies using different threaded libraries (e.g., sharing a common set of threads between TBB and OpenMP) within a process. But it's not quite there yet. For the present it remains a resource issue that is left to the applications to solve.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are these various processes taking up the whole 3-4GB address space in their separate contexts? And how much physical memory is there to support this? 3-4GB *
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"Throttle-down" is the issue at the moment. Creation of the TBB thread pool is still pretty "static" in that you can change the size of the thread pool but only by throwing way the current one and creating a new one. We've actually provided sample code using a dynamic task_scheduler_init object so that it could be easily thrown away and recreated to give an application the ability to set thread pool size. As currently realized, however, it would be up to the application (or cooperating applications) to recognize the oversubscription and back off on their resource use.
But evolution to a common thread pool has already started, beginning at the leaves, of course: Intel provides an OpenMP runtime library, libiomp5md.dll, that uses an API shared with implementations of OpenMP bytheMicrosoft and GNU compilers, meaning that whatever mix of compilers you use with OpenMP, their runtime needs can all be handled by a single library. The challenges to come are much larger than this of course, but they are recognized as the way forward.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Profiling the code shows that almost 50% of the CPU time on an 8 core machine is being spent in the routine
"local_wait_for_all", which is internal to the tbb_dll.
My pipeline now has only one item.
The profiler indicates negligible time in my concurrent_bounded_queue "push" or "pop" routines.
What does all that CPU time being spent in "local_wait_for_all" indicate?
I previously used the "try_pop" in the unbounded version of the concurrent_queue. My app did the wait.
That didn't make a difference, so I'm currently letting TBB "pop" do any (apparently little) waiting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mitch, you could try specifying threads number manually. If you know number of parallel processes, you can pass it or number of working threads to the processes via command line and initialize scheduler using this number.
So, all the processes would work together on at most 8 threads - i.e. without oversubscription.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You mean like Apple's Grand Central Dispatch, currently shipping as part of Mac OS X Snow Leopard for Intel-based Macs?
"But it's not quite there yet."
Hmm...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mitch,
Here is a hack that you can consider.
Assume you have a database in registry or file (your choice) that contains a list of records identifying the running applications wishing to share in a thread pool co-opetition. The database records containing (your design) something along the line of
Process ID
minComputeThreads
maxComputeThreads
IOthreads
priority
(other)
When a process starts, it adds its record to the database, when the process ends it removes its record from the database.
You write a function that given the process ID andcurrent records in the database, together with information about the platform, will produce an ideal number of compute threads recommended for use by the current application.
Assuming your apps are all TBB (adopt for mixtures with OpenMP).
At startup, each TBB thread will obtain (once from an InterlockedIncrement) a cardinal thread number (0, 1, 2, ...). And all threads are forced to call.
void ChangeNumberOfParticipants()
{
if(CardnalThreadNumber < IdealNumberOfThreads)
return;
InterlockedDecrement(&NumberOfParticipatingThreads); // initialized to all threads count
while(CardnalThreadNumber < IdealNumberOfThreads)
{
Sleep(YourPollingFrequencyHere);
PoleForChange();
}
InterlockedIncrement((&NumberOfParticipatingThreads);
}
This will cause the TBB threads to place themselves into a sleep/pole mode when required. And
Periodically, in an outer loop. When the master thread (0 cardinal number) notices when the new ideal number of threads is less thanthe current number of participating threads it schedules a ChangeNumberOfParticipants task.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The thing is, my number of threads added by TBB in each app is always 8 (1 per core). I do this by specifying the number of instances in my pipeline class (like nBuffers in the tbb example). If I go below threads, I'm then specifically reducing the useful throughput, rather than whatever eats up the CPU time when multiple schedulers collide. If I'm running 4 app instances, would the ideal number of threads be 2? If so, I've already drastically reduced the power of the app significantly. In this application, both simple and complex graphics jobs get thrown at the apps. For complex ones they need the power of all 8 cores operating within one app. Simple jobs don't need all 8 and their throughput is best increased by having more app instances running.
I produced a non-TBB version of the app for comparison. The TBB version blows it away on complex jobs, being 3-4 times faster on an 8 core machine. (The serial portion and Amdahl's law limits it from 8x).
For very simple jobs, however, I can start up 6 of the non-TBB apps and since they don't share resources, I get almost 6x the throughput. Starting up even 4 of the TBB apps on a simple job never gets me more than 2.5x; 6 would probably get me to 2.7x.
I therefore think the only solution to this is the OS-based single task scheduler one described as "in the works" above. We'll just have to wait. I'm working on detecting and auto switching between TBB and non-TBB versions depending on the type of job.
Thanks to all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think Jim's code still needs a little work to work, but if you can detect and autoswitch between TBB and non-TBB versions of the code, then you could switch between TBB versions of the code that create smaller thread pools. As Anton and I suggested, applications can control the size of their TBB thread pools by providinga numerical argument to the first task_scheduler_init object creation that they do. I could conceive of a behavior where the application checks on startup for any other copies running and then chooses either the default number of HW threads (maybe 8) or some smaller number (1 or 2).
Ignoring all the races between processes that ultimately such a scheme may have to deal with there's still a basic chicken-or-egg problem here: I start up the first copy--no competition--8 threads; I start a second and it takes one or two and so on, but my machine is still oversubscribed until the first thread finishes. This calls for a more dynamic method, where the original copy can detect the arrival of subsequent copies ofthe applicationand somehow dial back its thread pool by some means like Jim was trying to achieve with his code sample. TBB's non-premptive scheduler means each worker is (relatively) autonomous, so there might be some wait involved as the terminating thread pool finishes up its assigned tasks. I think some workable scheme might be derived from this although I think you could spend a lot of time getting the heuristics right for how big the replacement thread pool should be.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Turks,
The TBB app always starts with number of threads = number of hardware threads (8 in your case). TBB assumes all threads are compute bound and will not preempt once started (excepting for cancel task...). TBB makes no requirements as to the number of threads that will be involved in running your app (tasks spawned). In the event that all threads but current thread are busy, current thread will run all tasks spawned by current thread assuming those spawned tasks complete prior to other busy threads.
The concept I outlined is to make threads over number of ideal threads to be "busy" in sleep/poll loop.
In this manner you never change the TBB thread pool size for a running app. In lieu of all extra threads polling, one could poll and the others could WaitForSingleObject. Start with the poll technique then improve the method of altering the running thread set. (keeping the TBB thread pool the same size).
"3-4 times faster on 8-core machine" Do you mean 4-core machine with HT (8 hardware threads) e.g. Single Core i7 920?.
If you mean dual socked x 4-core (8 cores without HT) then 3-4 times faster means
a) memory bound
b) I/O bound
c) too many locks
(and combinations of above)
I would then suggest using VTune or PTU to help find the hot spots and make corrections there.
I also expect that with reasonable amount of effort you should attain 85% of full capacity (~6.8x that of single core)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As others already said in this discussion, the number of threads created by TBB can be throttled by the parameter to task_scheduler_init. Once there exists an instance ofthe task_scheduler_init class, all subsequently created instances of itdo not affect the number of threads. So in order to resize the TBB thread pool you should first destroy all active task_scheduler_init objects, and then create a new one with the desired number of threads.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
It is a dual socketed x 4-core machine without HyperThreading, i.e. there are 8 real cores.
Witha single app running, using TBB, and allowing 8 objects in the pipeline, the 3-4x instead of 8 is due to the serial speed limitation of the code feeding the pipeline. If I stub out all the guts of actual work in the pipeline, the job goes 5x faster, i.e. it only take 20% of the time. Thus as Reiners describs Amdahl's Law, the other 80% can be reduced by a factor of 8 for 8 cores making it take 10%. That plus the serial 20% makes 30% or about 3.3x faster.
I'm fully ok with that.
The slowdown problem I'm having is when I start up 2-4 of these apps, each using TBB and its own task_scheduler,
the apps slow down and I get only 1.5-2.5x overall throughput. On an 8-core machine, if I start up 4 non-TBB based apps, the apps don't slow down and so I typically get close to 4x overall throughput.
I would be real curious if anyone has managed to start up 8 separate TBB-based apps and have them run anywhere near 85% of full speed each!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think the balancing of multiple processes and multiple task_scheduler_init objects is beyond the normal scope of TBB use, at least for me.
The question now becomes,
Why and where is there so much CPU usage (40%-50% even) when running a non-compute-intensive app with TBB on an 8-core machine and I limit the pipeline to having only 1 data item in it at most?
If, instead of limiting the pipeline to 1 data item, I instead turn off TBB and the pipeline (run everything serially), the CPU usage is, as expected, about 12% (1 of 8 cores). Using TBB and limitting the pipeline to 1 data item, to simulate the everything serial case as much as possible, I see CPU resource usage at 40%-50%, nothing close to 12%.
Is this TBB overhead and is it reduceable?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Turks,
One of the other forum members can give you a specific answer, I can can give you a generalized answer that can point you/them in the right direction.
If I were to guess...
TBB likely has a tuning parameter equivalent to the OpenMP block time. This tuning parameter is used when a thread is unable to locate an additional task to work on. This parameter specifies the length of time or number of iterations the thread should continue to hunt for new tasks to arrive. After this time or number of iterations expire (without intervening arrival of new task), the thread then suspends.
When a single suchapplication is running on your system, the computation time spent hunting for more work interferes with the system Null Process. Meaning you are trading off some kilowatts against reduced latency.
When running your single 8-buffer pipeline, where the feeder is the bottleneck, the worker threads are spending a significant amount of time in the search for task (block time hunt for task function), I would guess from your figures a task arrives before expiration of the block time.
This, latency reducing overhead, introduced into the single TBB application reduces latency for that single TBB application at the expense of available Null Process run time .AND. any other process that wishes to run. Thus, starting up two or more such TBB applications, causes a reduction in available processing time for the other TBB applications, and those applications in return cause a reduction in processing time for your applicaiton. This reduction in processing time, now causes an increase in the time, or should I say a reduction in the feeder rate on the input side (and output side if present) of the pipe line, which then causes longer unproductive burn time in the in the search for task in your TBB app, which does the same for the other TBB apps, and likewise back and forth.
If there is such a block time (or equivalent to OpenMP "throughput"/"turn around") then experiment with that. If there isn't then dig into the code if you can, and add such an option.
What you want to experiment with is when no immediate task, use a very short block time first (typically using _mm_pause()), then fall back on using SwitchToThread(), then Sleep(0), then suspend thread.
---------------
I think you and the other readers did not quite understand what I was proposing in an earlier post.
As an experiment that can be quite easily done and will illustrate what I am talking about...
Set up your current app to init and run with 8 threads.
Time one instance of app running
Time two instances of the app running
Save the times
Now make a slight modification to your app
At startup, after init with the 8 threads, spawn 4 tasks that sit in Sleep loop(perhaps looking for Done flag).
Now time two instances of this app running
I expect, faster run times than in the prior run of two apps
The suggestion I made in the earlier post, is a means to dynamically change the number of threads entering and exiting this seemingly "useless" sleep till told not to task.
The purpose of this task is to "reduce" the wasted burn time looking for task that are not present.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why and where is there so much CPU usage (40%-50% even) when running a non-compute-intensive app with TBB on an 8-core machine and I limit the pipeline to having only 1 data item in it at most?
If, instead of limiting the pipeline to 1 data item, I instead turn off TBB and the pipeline (run everything serially), the CPU usage is, as expected, about 12% (1 of 8 cores). Using TBB and limitting the pipeline to 1 data item, to simulate the everything serial case as much as possible, I see CPU resource usage at 40%-50%, nothing close to 12%.
Is this TBB overhead and is it reduceable?
Jim correctly noted that there is a parameter in TBB which determines how much time a thread looks for work to arrive somewhere. It is hard-coded, not adjustable.
The problem you see is that the current pipeline implementation spawns a new task for each data item passing through the pipeline, and it constantlywakes up all workers, for no good reason. In other words, the case of a single data item limitation is unusual, not really expected in production code, so the implementation did not care of its potential problems.
The good news is that we are going to address this soon.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, while I agree in principle that oversubscription might occur in this scenario, I'm not convinced that this is what is happening. If we could measure how much useful work is done per context switch, that would be a good indicator as to whether or not oversubscription is the problem. If I read your post correctly, you do have a version that uses threading primitives and not TBB. If we could measure from that application how much useful work is done per quantum, I feel we would have a better idea as to what is going on when the same measurement is made on the TBB code. In my opinion we need a good measure that gives us a criteria upon which to make a comparison between the applications, to find what is really happening.
I'm not sure if VTune can give us the right information or not. Perhaps a good measure would be the number of instructions retired divided by actual thread runtime? Since you have multiple processes and threads, I would think that a sum of instructions retired by the sum of runtimes would suffice. If anyone has a better metric, please suggest one.
I'll watch this thread, I'm very curious to see the results of this measurement on your application, if only just to satisfy my own curiosity.
Thanks,
AJ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim correctly noted that there is a parameter in TBB which determines how much time a thread looks for work to arrive somewhere. It is hard-coded, not adjustable.
The problem you see is that the current pipeline implementation spawns a new task for each data item passing through the pipeline, and it constantlywakes up all workers, for no good reason. In other words, the case of a single data item limitation is unusual, not really expected in production code, so the implementation did not care of its potential problems.
The good news is that we are going to address this soon.
Hmm, sounds like an impatient programmer symptom.
QuickThread has a tunable parameter (per thread wait context), and only wakes up 0 or 1 thread depending on state of other thread(s). No wake up if all are busy, no wake up if desired thread is scrounging (in block time hunt for task), 1 thread if thread were sleeping. This reduces the unproductiveoverhead in unnecessary context switches. (i.e. less interference to other processes running on the system).
At least the TBB team recognizes this and will address this issue.
In the case of running multiple multi-threaded applications, I agree that there is an adverse performance interaction using most task scheduling systems (TBB, QuickThread, and newer OpenMP features) where each assumes full system processing resources are available for their use alone. QuickThread will address this issue after we get V1.0 out to end users. The functionality for implementing dynamically tuning/de-tuning is built in to QuickThreadbut is yet to be enabled and thoroughly tested.
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page