Re: tbb::pipeline instance using excessive CPU when idle

James_Grimplin · ‎06-01-2009

Wenoticed our application that is built on the tbb::pipeline class was using excessive processing cycles in an interesting and predictable manner. After attempting to localize the problem in our code without success, we attempted to duplicate the problem with the text_filter example provided in the distribution and we succeeded.

First, the description of the behavior is basically that the CPU (or a core) is kept busy doing nothing when the pipeline should be idle as there is no input data available and all processing for the previous data is complete. However, this excessive utilization only occurres on every other input data.If the data rate is slow enoughone can watch the CPU utilization bounce from 0 to 50 (on my Core 2 Duo).

I have included the slightly modified text_filter.cpp to illustrate the issue. It was modified to allow more time between iterations (so you can actually see the CPU activity) and to not run in serial mode first. Note that this is the same behavior we see in our application and my input data rate is being controlled externally (socket interface) and not with artifical delays.

Any thoughts on why the pipeline is behaving in this manner?

Modification summary:
- added #include "tbb/tbb_thread.h"
- added in MyOutputFilter::operator():
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
- commented out serial run in main()

Here is the full modified text_filter.cpp example code:

#include "tbb/pipeline.h"
#include "tbb/tick_count.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/tbb_thread.h"
#include
#include
#include
#include

using namespace std;

//! Buffer that holds block of characters and last character of previous buffer.
class MyBuffer {
static const size_t buffer_size = 10000;
char* my_end;
//! storage[0] holds the last character of the previous buffer.
char storage[1+buffer_size];
public:
//! Pointer to first character in the buffer
char* begin() {return storage+1;}
const char* begin() const {return storage+1;}
//! Pointer to one past last character in the buffer
char* end() const {return my_end;}
//! Set end of buffer.
void set_end( char* new_ptr ) {my_end=new_ptr;}
//! Number of bytes a buffer can hold
size_t max_size() const {return buffer_size;}
//! Number of bytes appended to buffer.
size_t size() const {return my_end-begin();}
};

static const char* InputFileName = "input.txt";
static const char* OutputFileName = "output.txt";

class MyInputFilter: public tbb::filter {
public:
static const size_t n_buffer = 8;
MyInputFilter( FILE* input_file_ );
private:
FILE* input_file;
size_t next_buffer;
char last_char_of_previous_buffer;
MyBuffer buffer[n_buffer];
/*override*/ void* operator()(void*);
};

MyInputFilter::MyInputFilter( FILE* input_file_ ) :
filter(serial_in_order),
next_buffer(0),
input_file(input_file_),
last_char_of_previous_buffer(' ')
{
}

void* MyInputFilter::operator()(void*) {
MyBuffer& b = buffer[next_buffer];
next_buffer = (next_buffer+1) % n_buffer;
size_t n = fread( b.begin(), 1, b.max_size(), input_file );
if( !n ) {
// end of file
return NULL;
} else {
b.begin()[-1] = last_char_of_previous_buffer;
last_char_of_previous_buffer = b.begin()[n-1];
b.set_end( b.begin()+n );
return &b;
}
}

//! Filter that changes the first letter of each word from lower case to upper case.
class MyTransformFilter: public tbb::filter {
public:
MyTransformFilter();
/*override*/void* operator()( void* item );
};

MyTransformFilter::MyTransformFilter() :
tbb::filter(parallel)
{}

/*override*/void* MyTransformFilter::operator()( void* item ) {
MyBuffer& b = *static_cast(item);
int prev_char_is_space = b.begin()[-1]==' ';
for( char* s=b.begin(); s!=b.end(); ++s ) {
if( prev_char_is_space && islower((unsigned char)*s) )
*s = toupper(*s);
prev_char_is_space = isspace((unsigned char)*s);
}
return &b;
}

//! Filter that writes each buffer to a file.
class MyOutputFilter: public tbb::filter {
FILE* my_output_file;
public:
MyOutputFilter( FILE* output_file );
/*override*/void* operator()( void* item );
};

MyOutputFilter::MyOutputFilter( FILE* output_file ) :
tbb::filter(serial_in_order),
my_output_file(output_file)
{
}

void* MyOutputFilter::operator()( void* item ) {
MyBuffer& b = *static_cast(item);
int n = fwrite( b.begin(), 1, b.size(), my_output_file );
if( n<=0 ) {
fprintf(stderr,"Can't write into %s file\n", OutputFileName);
exit(1);
}
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
return NULL;
}

static int NThread = tbb::task_scheduler_init::automatic;
static bool is_number_of_threads_set = false;

void Usage()
{
fprintf( stderr, "Usage:\ttext_filter [input-file [output-file [nthread]]]\n");
}

int ParseCommandLine( int argc, char* argv[] ) {
// Parse command line
if( argc> 4 ){
Usage();
return 0;
}
if( argc>=2 ) InputFileName = argv[1];
if( argc>=3 ) OutputFileName = argv[2];
if( argc>=4 ) {
NThread = strtol(argv[3],0,0);
if( NThread<1 ) {
fprintf(stderr,"nthread set to %d, but must be at least 1\n",NThread);
return 0;
}
is_number_of_threads_set = true; //Number of threads is set explicitly
}
return 1;
}

int run_pipeline( int nthreads )
{
FILE* input_file = fopen(InputFileName,"r");
if( !input_file ) {
perror( InputFileName );
Usage();
return 0;
}
FILE* output_file = fopen(OutputFileName,"w");
if( !output_file ) {
perror( OutputFileName );
return 0;
}

// Create the pipeline
tbb::pipeline pipeline;

// Create file-reading writing stage and add it to the pipeline
MyInputFilter input_filter( input_file );
pipeline.add_filter( input_filter );

// Create capitalization stage and add it to the pipeline
MyTransformFilter transform_filter;
pipeline.add_filter( transform_filter );

// Create file-writing stage and add it to the pipeline
MyOutputFilter output_filter( output_file );
pipeline.add_filter( output_filter );

// Run the pipeline
tbb::tick_count t0 = tbb::tick_count::now();
pipeline.run( MyInputFilter::n_buffer );
tbb::tick_count t1 = tbb::tick_count::now();

// Remove filters from pipeline before they are implicitly destroyed.
pipeline.clear();

fclose( output_file );
fclose( input_file );

if (is_number_of_threads_set) {
printf("threads = %d time = %g\n", nthreads, (t1-t0).seconds());
} else {
if ( nthreads == 1 ){
printf("single thread run time = %g\n", (t1-t0).seconds());
} else {
printf("parallel run time = %g\n", (t1-t0).seconds());
}
}
return 1;
}

int main( int argc, char* argv[] ) {
if(!ParseCommandLine( argc, argv ))
return 1;
if (is_number_of_threads_set) {
// Start task scheduler
tbb::task_scheduler_init init( NThread );
if(!run_pipeline (NThread))
return 1;
} else { // Number of threads wasn't set explicitly. Run single-thread and fully subscribed parallel versions
{ // single-threaded run
//tbb::task_scheduler_init init_serial(1);
//if(!run_pipeline (1))
// return 1;
}
{ // parallel run (number of threads is selected automatically)
tbb::task_scheduler_init init_parallel;
if(!run_pipeline (0))
return 1;
}
}
return 0;
}

Alexey-Kukanov · ‎06-01-2009

I think the root of the problem is that wecouldnot allow the master thread falling asleep inside the scheduler; so it spins even if no work is available. If we would put it to sleep we would have to invent a complicated logic to wake it up in all proper cases, otherwise the program would not complete at all.

Therefore ifa worker thread executes the input stage and get blocked there, the master thread will still do busy-waiting. But if the master thread blocks, all workers will fall asleep. I think that explains the "instability" of timing you got.

RafSchietekat · ‎06-01-2009

TBB and blocking don't mix very well. Maybe you should consider waiting for a limited time only (zero or non-zero), and then restarting the pipeline when new data is detected, if you can suffer the extra latency associated with draining the pipeline (average and maximum). Depending on the statistical characteristics of the input, some tuning of the input time limit may be required for best results. Even if you aren't looking at energy consumption, a stalled pipeline should be stopped to prevent different degrees of starvation elsewhere in the program.

James_Grimplin · ‎06-02-2009

Well that is very interesting information and it does explain the behavior we are seeing. However, I would have to say that although this behavior may be expected, given insight into the implementation, it is a shortcoming of the implementation. I have not looked at the implementation so I can not comment on why, or even how, it does what it does, but I believe that TBB needs deal with blocking (I/O in particular)for it to be a successful abstraction allievating the programmer of the details of threaded programming. The pipeline abstraction appeared to be a very good match for our number crunching application allievating us of the glue and housekeeping code of a traditional threaded approach. It did in fact greatly reduce the amount of code we had to generate, but unless I can find a workaround I fear that we are going to have to reimplement the application in the traditional manner wherewe can insure that idle tasks (threads) are in fact idle.

Starting and stopping the pipeline is not a reasonable approach from my perspective. Again, it basically complicates what seemed to be an elegant solution. I could easily cause the pipeline to terminate, but all of the additional changes that would be required to deal with the input interface would be complicated by the fact that things are starting up and terminating repeatedly.

Any other ideas on a workaround?

Thanks.

RafSchietekat · ‎06-03-2009

"our number crunching application"
Then where does the blocking arise?

James_Grimplin · ‎06-03-2009

Quoting - Raf Schietekat

"our number crunching application"
Then where does the blocking arise?

Blocking occurres on the input. I do not control the rate at which data arrives. It's variable and we block (via a select) on a socket waiting for the arrival of the data. Generally two "packets" of data will arrive one after the other and we then can process them concurrently (minimizing latency) before sending the processed data on to the final output stage, which is also a socket interface. The data processing is pure number crunching. It could be further decomposed into multiple pipeline stages, but initially we choose not to rework the algorithm until we determined if the pipeline approach would work in our aplication. Future uses of the application will provide an increase in the input data rate. I believe the pipeline architecture would allow us to deal with the increased data rate (and increased processing) by either providing the ability to run more processing stages concurrently, and/or allow us to further decompose the processing into multiple stages.

So even if wewere to move the socket interface out of the input stage andimplement multiple stages for the processing, the data feed to the pipeline will still be be variable resulting in a situation where the pipline is idle at times and the master thread is spinning. There is no way to eliminate this variability in our application.

RafSchietekat · ‎06-03-2009

"Starting and stopping the pipeline is not a reasonable approach from my perspective."
It's somewhat unfortunate, but it seems difficult to find a generally applicable mechanism that's also at the same time the best performing and the most elegant in each particular situation, so this one doesn't really look unreasonable to me, although I wouldn't be surprised, just interested, if somebody could come up with something better. We're not yet talking about several pipelines getting entangled and starving each other, just a few more lines of code...

"I do not control the rate at which data arrives."
Still, the statistical characteristics are important for deciding on a good solution. For example, if you have long waits between bursts, it makes more sense to drain and stop the pipeline. Otherwise there may be an optimum timeout based on the statistical characteristics of the input, the acceptable latency, the amount of work for each data item, the level of optional parallelism for each work item, the number of cores, and perhaps other parameters. Difficult to say. But please take this only at face value and make up your own mind!

Still, perhaps if a task could say, hey, I'm going to be waiting now, and when I wake up I'm not going to be doing a whole lot to compete with other threads until I finish, so meanwhile you might as well dispatch another worker from among the excess ones, TBB might just be more blocking-friendly than it is now...

James_Grimplin · ‎06-03-2009

Characteristics of the data are straightforward:
- 2 packets (~10KBytes) received periodically (350 ms interval,shortestperiod today)
- Data rate is generally constant for some period of time (minutes to hours)
- Data rate can slow to a pair of packets every >1 s
- Each input data packet is processed and generates an output data packet of ~2 MBytes
- The processing stage takes ~50 msec to process an input packet into an output packet (3.17 GHz Core 2 Duo)
- The processing of a packet is primarily a sequential operation (it can be divided into stages)

Yes, I could have implemented this as a single thread process that waits for a message, processes it, and finally outputs the processed data before looping back and starting all over. However, that implementation has scaling issues as we move forward in terms of increased data rates and sizes as well as algorithim refinements that may require additional processing time. Additionally, there are other aspects of the application that complicate the solution.

"just a a few more lines of code..."
Well the input interface is a bit more complicated than may be apparent from my simple description. Additionally, there would be more overhead in this approach. I'm not sure if just re-running a cleared pipeline would still incurr the initial startup overhead we have seen, but there is at least some additional burden with the approach although it may be less than the wasted CPU cycles currently being burned. It's not that the approach itself is completely unreasonable, I just think the benefits of the pipeline may be starting to evaporate. The appealing aspects of the pipeline were minimal management and code while being provided concurrency.

"Still, perhaps if a task could say, hey..."
Sorry, I'm not following the though in this paragraph.

James_Grimplin · ‎06-03-2009

"- The processing stage takes ~50 msec to process an input packet into an output packet (3.17 GHz Core 2 Duo)"

Sorry, I only described the smaller input data packet. The worst case processing timing is ~150 msec per input packet for our large (~100 KBytes) packets, whileour smaller input packet (~10KBytes) can be processed in ~50 msec. Its with the smaller packets, and thus lower processing time, where we see the impact of the pipeline thread spinning since there is significantly more idle time in that scenario.

RafSchietekat · ‎06-04-2009

"Characteristics of the data are straightforward:"
Those are indeed long waits in computer time. I would still think of running select for 50-100 ms (ballpark, should be corrected by tuning, I suppose) before shutting down the pipeline (hmm, how long does a restart take? what are the latency requirements?). But if arrival times are predictible, they could be used instead of simple fixed timeouts.

"I'm not sure if just re-running a cleared pipeline would still incurr the initial startup overhead we have seen"
I hope you're not confusing pipeline startup with task_scheduler_init startup... a good approach is to have at least one long-lived task_scheduler_init somewhere in main(), i.e., for the lifetime of the process.

"there is at least some additional burden with the approach although it may be less than the wasted CPU cycles currently being burned"
Is that an environmental or electricity cost concern (I'm all for that resp. sympathetic), or related to other processes on the same system? The main thread wouldn't be violently twiddling its thumbs if there were an eligible task somewhere ready for the stealing.

"The appealing aspects of the pipeline were minimal management and code while being provided concurrency."
You might also have a look at parallel_while (which might as well be undeprecated), but I guess it would have the same problem (no time to go and verify that now, though).

"Sorry, I'm not following the though in this paragraph."
That was just about a possible change to TBB itself, although minutes after that I started to wonder what it would do to the transitive relation for the sake of affinity between tasks and caches by way of threads... Another suggestion I would make is to have a way to concurrently revive a pipeline while it is shutting down; I might have a look at that one myself sometime.

OK, a possible workaround to reduce the spinning: when you detect that there is no work pending, check whether you're in the main thread; if you're not, send a dummy through the pipeline, thus giving another thread, hopefully the main thread, the chance of detecting the same condition, at which time you just block.

Vivek_Rajagopalan · ‎06-04-2009

Quoting - Raf Schietekat

OK, a possible workaround to reduce the spinning: when you detect that there is no work pending, check whether you're in the main thread; if you're not, send a dummy through the pipeline, thus giving another thread, hopefully the main thread, the chance of detecting the same condition, at which time you just block.

Sorry for butting in but we have a similar issue with pipeline blocking and your workaround sounds like it fits the bill. JUst some questions.

1) It sounds like if the main thread (the thread that did task_scheduler_init and started the pipeline) blocks on input, then the pipeline wont spin. On the other hand if a worker thread blocks on the input filter first the pipeline will spin. I hope I read your message correctly ?

2) How do we determine whether or not a particular filter is executing in the main thread ? (Sorry if this is a basic question, I cant find an answer in the doc or in the O Reilly book)

3) If we send a sufficiently large number of such dummies (say 1000), can we be sure of the main thread seeing one and blocking ? Is there a way to send a dummy specifically targeted at the main thread ?

My problem is similar to the OP's i.e input blocking. I will post details in another post because I dont want to steal the OP's thread (nopun).

RafSchietekat · ‎06-04-2009

"1) It sounds like if the main thread (the thread that did task_scheduler_init and started the pipeline) blocks on input, then the pipeline wont spin. On the other hand if a worker thread blocks on the input filter first the pipeline will spin. I hope I read your message correctly ?"
That's what I understood from Alexey Kukanov in #1.

"2) How do we determine whether or not a particular filter is executing in the main thread ? (Sorry if this is a basic question, I cant find an answer in the doc or in the O Reilly book)"
Record the main thread's tbb_thread::get_id() for later reference, I suppose.

"3) If we send a sufficiently large number of such dummies (say 1000), can we be sure of the main thread seeing one and blocking ? Is there a way to send a dummy specifically targeted at the main thread ?"
No, and therefore the total busy-waiting may get a lot worse, but I think the likelihood of that happening may be low enough to try this (a necessary condition is that processing of data items be bounded). But just to be sure, when I suggested to "send a dummy through the pipeline", I meant a specially marked pointer or referent (NULL from the input filter is reserved for shutting down the pipeline), that later filters know to ignore, just to give hopefully the main thread a chance to execute the input filter (wait with zero or short timeout, if nothing detected and also the main thread block forever). This is only if the pipeline is the main event loop of the process, of course, otherwise other complications may arise. And note that it's just an idea, I haven't tried it myself yet, and I would still go with my first suggestion instead (shut it down), or try something more radical.

"My problem is similar to the OP's i.e input blocking. I will post details in another post because I dont want to steal the OP's thread (nopun)."
I myself would only look at whether the subject/title covers it (create tasks, not threads). And I'll pretend that you didn't suggest that a thread could get stolen (in TBB, it's the thread that does the stealing). :-)

(Added) It also seems fairly easy to see, unless that's just because I would be mistaken,that for this workaround to be at all beneficial, the amount of time spent on work vs. being blocked must be low enough, going down with the number of threads involved; it may help to use a non-zero timeout.

James_Grimplin · ‎06-05-2009

Here is a new development in my pipeline adventure...

On my Core 2 Duo my application performs generally as I indicated in an earlier post (~150 ms per packet @ 2 packets every ~350 ms). Now this "consumed" about 60 % of the available CPU performance as seen from the Windows Task Manager and ran with about six (6) threads from the Task Manager view.

We moved the application (no changes) to a ~3 GHz Dual Quad Core Xeon (its eventual target) for initial testing. I assumed that the "processing" would improve slightly and our "excessive" CPU utilization would drop to about 15% (total over 8 cores) when all the conditions were right. Well, the excessive utilization estimate was pretty much spot on. What I did not expect was that the "total" CPU utilization indicated by Windows Task Manager was still at 60% and the processing time per packet was not really any better. Additionally, Task Manager now reports somewhere in the neighborhood of 60 threads...

Now this is not sitting very well with me at the moment (and we need to get some more info, but it was the end of the day). I expected minimal usage of the new systems CPU's, but at the same data rates its utilization isn't changing, something is amiss.

BTW, I can account for three of the six threads on the Core 2 Duo (main and two tbb_threads I manually create) so I'm assuming the pipeline is creating the other three.

Anyway, life keeps being interesting...

jimdempseyatthecove · ‎06-05-2009

James,

I was hoping that Alexy or Robert would be able to help you with your pipeline problem using TBB.

In QuickThread (a tool I wrote) we have two classes of tasks: Compute and I/O so the I/O portions of the pipeline application can be performed by the I/O clases of threads. Also, althoughthe "main" which generally launches the thread pool, but once the pool islaunched, there is no "main" thread, at least from a task scheduling viewpoint. The launching thread, which may be the main thread or other thread launched by the main thread, does have the responsibility of enqueuing the top-level task (or simply assuming the role of the top-level task). Other than that, there is no other distinction. As long as any task are remaining to run, none of thethreads will end. When task stealing cannot find a task to run (and after a tunable spinwait time) the thread suspends on WaitForSingleEvent. Which will occure if athread is waiting and an appropriate task is enqueued. Other than for a short spinwait time you won't see a long burn time when insufficient tasks are enqueued and threads are starved for work. In the case of the pipeline, as long as any tasks remain, be they I/O or compute class, the entire thread pool remains intact. When threads starve for work, they suspend, when new work arrives, they are awoken.

Rewriting the TBB sample upcase words test and running on a 4 core Q6600 (XP x64). 1,000,000 lines of "the quick brown fox jumped over the lazy grey dog's back nnn.... " where nnn... is a sequence number to assure the upcased outputbuffers were collated properly(file size is 64.345MB)

[cpp]SerialRunUpcaseWordsTest =   5.440969 seconds
ParallelRunUpcaseWordsTest = 1.415301 seconds
Serial/Parallel = 3.844389

[/cpp]

BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.

(add files does not seem to open a browse, else I would have attached the QuickThread version of the parallel pipeline upcase words test for you to see the difference in programming style)

Jim Dempsey

Alexey-Kukanov · ‎06-06-2009

Quoting - James Grimplin

Here is a new development in my pipeline adventure...

On my Core 2 Duo my application performs generally as I indicated in an earlier post (~150 ms per packet @ 2 packets every ~350 ms). Now this "consumed" about 60 % of the available CPU performance as seen from the Windows Task Manager and ran with about six (6) threads from the Task Manager view.

We moved the application (no changes) to a ~3 GHz Dual Quad Core Xeon (its eventual target) for initial testing. I assumed that the "processing" would improve slightly and our "excessive" CPU utilization would drop to about 15% (total over 8 cores) when all the conditions were right. Well, the excessive utilization estimate was pretty much spot on. What I did not expect was that the "total" CPU utilization indicated by Windows Task Manager was still at 60% and the processing time per packet was not really any better. Additionally, Task Manager now reports somewhere in the neighborhood of 60 threads...

Now this is not sitting very well with me at the moment (and we need to get some more info, but it was the end of the day). I expected minimal usage of the new systems CPU's, but at the same data rates its utilization isn't changing, something is amiss.

BTW, I can account for three of the six threads on the Core 2 Duo (main and two tbb_threads I manually create) so I'm assuming the pipeline is creating the other three.

Anyway, life keeps being interesting...

If you initialize TBB by default (with no explicit arguments passed to task_scheduler_init), it will create 1 worker thread on Intel Core 2 Duo processor, and 7 worker threads on the second system you mentioned.

My guess about ~60 threads is that you use another parallel library, such as OpenMP, either implicitly or explicitly, and it initializes its own thread pool for each thread that uses it. Thus the total number of threads is around square of the number of cores.

James_Grimplin · ‎06-06-2009

Quoting - Alexey Kukanov (Intel)

If you initialize TBB by default (with no explicit arguments passed to task_scheduler_init), it will create 1 worker thread on Intel Core 2 Duo processor, and 7 worker threads on the second system you mentioned.

My guess about ~60 threads is that you use another parallel library, such as OpenMP, either implicitly or explicitly, and it initializes its own thread pool for each thread that uses it. Thus the total number of threads is around square of the number of cores.

Yes, I forgot to mention that we are using the MKL, so it is probably generating some threads...but that's a lot of threads. The TBB thread count you mention is generally what I would have expected.

Anyway, it does not explain why the 8 core machine is being utilized at 60% to perform the same job that consumes 60% of a 2 core system.

James_Grimplin · ‎06-06-2009

Quoting - jimdempseyatthecove

James,

I was hoping that Alexy or Robert would be able to help you with your pipeline problem using TBB.

In QuickThread (a tool I wrote) we have two classes of tasks: Compute and I/O so the I/O portions of the pipeline application can be performed by the I/O clases of threads. Also, althoughthe "main" which generally launches the thread pool, but once the pool islaunched, there is no "main" thread, at least from a task scheduling viewpoint. The launching thread, which may be the main thread or other thread launched by the main thread, does have the responsibility of enqueuing the top-level task (or simply assuming the role of the top-level task). Other than that, there is no other distinction. As long as any task are remaining to run, none of thethreads will end. When task stealing cannot find a task to run (and after a tunable spinwait time) the thread suspends on WaitForSingleEvent. Which will occure if athread is waiting and an appropriate task is enqueued. Other than for a short spinwait time you won't see a long burn time when insufficient tasks are enqueued and threads are starved for work. In the case of the pipeline, as long as any tasks remain, be they I/O or compute class, the entire thread pool remains intact. When threads starve for work, they suspend, when new work arrives, they are awoken.

Rewriting the TBB sample upcase words test and running on a 4 core Q6600 (XP x64). 1,000,000 lines of "the quick brown fox jumped over the lazy grey dog's back nnn.... " where nnn... is a sequence number to assure the upcased outputbuffers were collated properly(file size is 64.345MB)

[cpp]SerialRunUpcaseWordsTest =   5.440969 seconds
ParallelRunUpcaseWordsTest = 1.415301 seconds
Serial/Parallel = 3.844389

[/cpp]

BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.

(add files does not seem to open a browse, else I would have attached the QuickThread version of the parallel pipeline upcase words test for you to see the difference in programming style)

Jim Dempsey

Sounds interesting Jim. So is QuickThread an enhancement to TBB or an additional component? What's the availability of QuickThread?

jimdempseyatthecove · ‎06-06-2009

No, it is a seperate threading tool with its own thread scheduler, templates, headers, etc...
if you Google

QuickThread site:intel.com

You will find a .PDF file uploaded several months ago.

Jim

Alexey-Kukanov · ‎06-06-2009

Quoting - James Grimplin

Yes, I forgot to mention that we are using the MKL, so it is probably generating some threads...but that's a lot of threads. The TBB thread count you mention is generally what I would have expected.

Anyway, it does not explain why the 8 core machine is being utilized at 60% to perform the same job that consumes 60% of a 2 core system.

MKL uses OpenMP, which was not designed for compatibility with other threading packages. In OpenMP, each parallel region is executed by the "team" of threads. By default, if you have two parallel regions started concurrently by different threads, each one will be served by a separate OpenMP thread team. If each TBB worker thread calls an MKL function parallelized with OpenMP, there are 64 threads in the app (TBB, on the contrary, uses singlethread pool, so if you called TBB algorithms from an OpenMP region, there would be 15 threads in total -the main thread, 7 OpenMP workers, and 7 TBB workers).
In addition, in Intel's OpenMP implementation by default threads do not park immediately after the parallel region ends, in the assumption that another one can start right after. So they idle-spin for a little. This behavior is controlled by an environment variable and possibly an API function as well. I do not know whether MKL zeroes this active wait time for OpenMP or not.

It might make sense to use single-threaded version of MKL, unless the outer-level parallelism exploited with TBB is not enough to utilize all processors and/or to provide proper load balancing. You also might reduce the number of threads in every OpenMP team (e.g. to 2), to minimize oversubscription. Generally, I would recommend you to contact Intel Premier Support for help with using MKL functions from TBB.

As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning.

James_Grimplin · ‎06-08-2009

"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."

Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?

BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).

jimdempseyatthecove · ‎06-08-2009

Quoting - James Grimplin

"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."

Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?

BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).

James,

I think you missed Alexey's remark regarding the OpenMP block time.

If, for a sillyexample, the block time is set for 200ms, and you run an #pragma omp, then #pragma omp for with a qualifier on it that results in one thread using the for then here is what you have

8 threads started to run the parallel region, one thread selected to run the loop, seven of the threads skip over the loop and due to block time of 200ms, will sit burning CPU for up to 200ms while waiting for thread processing loop completes. Should the loop take longer than 200ms (not likely) then your system lost 7 x 200ms of computation time. Should the loop take 1ms, and if the time spent by the controling thread is 10ms before it returns to the same section of code (without interviening parallel regions), then those seven other threads will burn those 11ms doing nothing. Upon entry to the parallel region, those seven threads will have their block time reset to 200ms.

So in this obscure hypothetical case, you will burn 100% of eight cores when the work was actualy being done with one core.

This is not to say blocking time isn't important. Correct use of blocking time results in greater throughput for the application (at the expense of available CPU time for other applications).

http://comp.astro.ku.dk/Twiki/view/CompAstro/IntelOpenMP?skin=print.pattern

Here is a link to a short answer.

I do not know why this is not so easy to find in the documentation.

Jim Dempsey