We are experiencing the same - Page 2

James_Grimplin · ‎06-01-2009

Wenoticed our application that is built on the tbb::pipeline class was using excessive processing cycles in an interesting and predictable manner. After attempting to localize the problem in our code without success, we attempted to duplicate the problem with the text_filter example provided in the distribution and we succeeded.

First, the description of the behavior is basically that the CPU (or a core) is kept busy doing nothing when the pipeline should be idle as there is no input data available and all processing for the previous data is complete. However, this excessive utilization only occurres on every other input data.If the data rate is slow enoughone can watch the CPU utilization bounce from 0 to 50 (on my Core 2 Duo).

I have included the slightly modified text_filter.cpp to illustrate the issue. It was modified to allow more time between iterations (so you can actually see the CPU activity) and to not run in serial mode first. Note that this is the same behavior we see in our application and my input data rate is being controlled externally (socket interface) and not with artifical delays.

Any thoughts on why the pipeline is behaving in this manner?

Modification summary:
- added #include "tbb/tbb_thread.h"
- added in MyOutputFilter::operator():
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
- commented out serial run in main()

Here is the full modified text_filter.cpp example code:

#include "tbb/pipeline.h"
#include "tbb/tick_count.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/tbb_thread.h"
#include
#include
#include
#include

using namespace std;

//! Buffer that holds block of characters and last character of previous buffer.
class MyBuffer {
static const size_t buffer_size = 10000;
char* my_end;
//! storage[0] holds the last character of the previous buffer.
char storage[1+buffer_size];
public:
//! Pointer to first character in the buffer
char* begin() {return storage+1;}
const char* begin() const {return storage+1;}
//! Pointer to one past last character in the buffer
char* end() const {return my_end;}
//! Set end of buffer.
void set_end( char* new_ptr ) {my_end=new_ptr;}
//! Number of bytes a buffer can hold
size_t max_size() const {return buffer_size;}
//! Number of bytes appended to buffer.
size_t size() const {return my_end-begin();}
};

static const char* InputFileName = "input.txt";
static const char* OutputFileName = "output.txt";

class MyInputFilter: public tbb::filter {
public:
static const size_t n_buffer = 8;
MyInputFilter( FILE* input_file_ );
private:
FILE* input_file;
size_t next_buffer;
char last_char_of_previous_buffer;
MyBuffer buffer[n_buffer];
/*override*/ void* operator()(void*);
};

MyInputFilter::MyInputFilter( FILE* input_file_ ) :
filter(serial_in_order),
next_buffer(0),
input_file(input_file_),
last_char_of_previous_buffer(' ')
{
}

void* MyInputFilter::operator()(void*) {
MyBuffer& b = buffer[next_buffer];
next_buffer = (next_buffer+1) % n_buffer;
size_t n = fread( b.begin(), 1, b.max_size(), input_file );
if( !n ) {
// end of file
return NULL;
} else {
b.begin()[-1] = last_char_of_previous_buffer;
last_char_of_previous_buffer = b.begin()[n-1];
b.set_end( b.begin()+n );
return &b;
}
}

//! Filter that changes the first letter of each word from lower case to upper case.
class MyTransformFilter: public tbb::filter {
public:
MyTransformFilter();
/*override*/void* operator()( void* item );
};

MyTransformFilter::MyTransformFilter() :
tbb::filter(parallel)
{}

/*override*/void* MyTransformFilter::operator()( void* item ) {
MyBuffer& b = *static_cast(item);
int prev_char_is_space = b.begin()[-1]==' ';
for( char* s=b.begin(); s!=b.end(); ++s ) {
if( prev_char_is_space && islower((unsigned char)*s) )
*s = toupper(*s);
prev_char_is_space = isspace((unsigned char)*s);
}
return &b;
}

//! Filter that writes each buffer to a file.
class MyOutputFilter: public tbb::filter {
FILE* my_output_file;
public:
MyOutputFilter( FILE* output_file );
/*override*/void* operator()( void* item );
};

MyOutputFilter::MyOutputFilter( FILE* output_file ) :
tbb::filter(serial_in_order),
my_output_file(output_file)
{
}

void* MyOutputFilter::operator()( void* item ) {
MyBuffer& b = *static_cast(item);
int n = fwrite( b.begin(), 1, b.size(), my_output_file );
if( n<=0 ) {
fprintf(stderr,"Can't write into %s file\n", OutputFileName);
exit(1);
}
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
return NULL;
}

static int NThread = tbb::task_scheduler_init::automatic;
static bool is_number_of_threads_set = false;

void Usage()
{
fprintf( stderr, "Usage:\ttext_filter [input-file [output-file [nthread]]]\n");
}

int ParseCommandLine( int argc, char* argv[] ) {
// Parse command line
if( argc> 4 ){
Usage();
return 0;
}
if( argc>=2 ) InputFileName = argv[1];
if( argc>=3 ) OutputFileName = argv[2];
if( argc>=4 ) {
NThread = strtol(argv[3],0,0);
if( NThread<1 ) {
fprintf(stderr,"nthread set to %d, but must be at least 1\n",NThread);
return 0;
}
is_number_of_threads_set = true; //Number of threads is set explicitly
}
return 1;
}

int run_pipeline( int nthreads )
{
FILE* input_file = fopen(InputFileName,"r");
if( !input_file ) {
perror( InputFileName );
Usage();
return 0;
}
FILE* output_file = fopen(OutputFileName,"w");
if( !output_file ) {
perror( OutputFileName );
return 0;
}

// Create the pipeline
tbb::pipeline pipeline;

// Create file-reading writing stage and add it to the pipeline
MyInputFilter input_filter( input_file );
pipeline.add_filter( input_filter );

// Create capitalization stage and add it to the pipeline
MyTransformFilter transform_filter;
pipeline.add_filter( transform_filter );

// Create file-writing stage and add it to the pipeline
MyOutputFilter output_filter( output_file );
pipeline.add_filter( output_filter );

// Run the pipeline
tbb::tick_count t0 = tbb::tick_count::now();
pipeline.run( MyInputFilter::n_buffer );
tbb::tick_count t1 = tbb::tick_count::now();

// Remove filters from pipeline before they are implicitly destroyed.
pipeline.clear();

fclose( output_file );
fclose( input_file );

if (is_number_of_threads_set) {
printf("threads = %d time = %g\n", nthreads, (t1-t0).seconds());
} else {
if ( nthreads == 1 ){
printf("single thread run time = %g\n", (t1-t0).seconds());
} else {
printf("parallel run time = %g\n", (t1-t0).seconds());
}
}
return 1;
}

int main( int argc, char* argv[] ) {
if(!ParseCommandLine( argc, argv ))
return 1;
if (is_number_of_threads_set) {
// Start task scheduler
tbb::task_scheduler_init init( NThread );
if(!run_pipeline (NThread))
return 1;
} else { // Number of threads wasn't set explicitly. Run single-thread and fully subscribed parallel versions
{ // single-threaded run
//tbb::task_scheduler_init init_serial(1);
//if(!run_pipeline (1))
// return 1;
}
{ // parallel run (number of threads is selected automatically)
tbb::task_scheduler_init init_parallel;
if(!run_pipeline (0))
return 1;
}
}
return 0;
}

James_Grimplin · ‎06-08-2009

"I think you missed Alexey's remark regarding the OpenMP block time."

Well I did not really miss it, I just did not think MKL would perform to this extreme. We have run some more tests and here is the summary:

10+ KB packet, MKL_NUM_THREADS=1 => 9 threads, 10% utilization, 75 ms/packet
10+ KB packet, MKL_NUM_THREADS=2 => 18 threads, 22% utilization, 79 ms/packet
10+ KB packet, MKL_NUM_THREADS=4 => 34 threads, 40% utilization, 80 ms/packet
10+ KB packet, MKL_NUM_THREADS=8 => 66 threads, 65% utilization, 80 ms/packet

100 KB packet, MKL_NUM_THREADS=1 => 9 threads, 12% utilization, 146 ms/packet
100 KB packet, MKL_NUM_THREADS=2 => 18 threads, 25% utilization, 150 ms/packet
100 KB packet, MKL_NUM_THREADS=4 => 34 threads, 45% utilization, 150 ms/packet

We also tried the KMP_LIBRARY and KMP_BLOCK_TIME environment variables, but they did not seem to have any effect. Could be that we are specifying them incorrectly. I could not find anything in the Intel MKL documentation related to these two environment variables (yet).

So it appears that in this latest configuration, MKL is the major contributor to idle spinning and not TBB.

For our application is does not appear that MKL (via OpenMP) parallelization in conjunction with TBB is effective at improving overall perfromance.

robert-reed · ‎06-08-2009

Quoting - James Grimplin

We have run some more tests and here is the summary:

10+ KB packet, MKL_NUM_THREADS=1 => 9 threads, 10% utilization, 75 ms/packet
10+ KB packet, MKL_NUM_THREADS=2 => 18 threads, 22% utilization, 79 ms/packet

[snip]

So it appears that in this latest configuration, MKL is the major contributor to idle spinning and not TBB.

A couple observations here. Not all of MKL is threaded. MKL uses threading for those operations where they expect parallel code to gain performance. I didn't see anything specific about what part of MKL is being used in this test, but the time-per-packet numbers suggest that the portion of MKL being employed runs serially.

The other operation regards Jim's comments on KMP_BLOCKTIME. First, it is a KMP (Intel extension)option, not an OpenMP option so it's probably not something used by MKL. Also Jim suggests Intel OpenMP"...will sit burning CPU for up to 200ms while waiting for thread processing loop completes...." It's actually a little worse than that. The KMP_BLOCKTIME interval starts when the threaded loop completes; that is, after all the threads have made theirrendezvous following the loop. Its purpose is to keep threads from paying the penalty of going to sleep and then waking up in regions of code comprising several parallel sections separated by short bursts of serial code. And the default, last time I checked, was 200 ms, which is a long time these days. With the speed of modern computers, I've recommended a setting of 1, 1 ms being sufficient time to hide lots of serial code. At least one application I know of uses a KMP_BLOCKTIME setting of 1 in a couple of places where they have chains of OpenMP serial code and 0 otherwise (to keep other parallel pools like TBB from yielding while OpenMP spins).

jimdempseyatthecove · ‎06-08-2009

>>The KMP_BLOCKTIME interval starts when the threaded loop completes

!?!?!

DoDinkyFunctionAllThreads();

DoOtherDinkyFunctionAllThreads();

Note, the above does not have a barrier after the parallel for loop.
Six of the 8 threads will do back to back function calls (bypassing the for loop)
What you said amounts to those six threads will burn cycles for 10 minutes.

I believe the 200ms timer gets set one of
a) for each thread when it reaches the end of parallel region
b) for each thread when/as any thread in team reaches end of parallel region.
(same for waiting at internal barrier, etc...)

Those two choices would make sense. This is not to say software vendors do things that make sense.

Note, option b) extends all threads (form that team) timeout to +200ms from the last thread reaching the end of the parallel region. This would seem to make sense.

Jim Dempsey

robert-reed · ‎06-08-2009

Quoting - jimdempseyatthecove

>>The KMP_BLOCKTIME interval starts when the threaded loop completes

!?!?!

DoDinkyFunctionAllThreads();

DoOtherDinkyFunctionAllThreads();

Note, the above does not have a barrier after the parallel for loop.
Six of the 8 threads will do back to back function calls (bypassing the for loop)
What you said amounts to those six threads will burn cycles for 10 minutes.

I believe the 200ms timer gets set one of
a) for each thread when it reaches the end of parallel region
b) for each thread when/as any thread in team reaches end of parallel region.
(same for waiting at internal barrier, etc...)

Sorry, I should have said "at the end of the parallel section," not "when the threaded loop completes" though for cases simpler than Jim's example, these may occur at the same time.

Here's the description from the icl 11.0.074 compiler document for KMP_LIBRARY "throughput" mode:

The throughput mode allows the program to detect its environment conditions (system load) and adjust resource usage to produce efficient execution in a dynamic environment.

In a multi-user environment where the load on the parallel machine is not constant or where the job stream is not predictable, it may be better to design and tune for throughput. This minimizes the total time to run multiple jobs simultaneously. In this mode, the worker threads yield to other threads while waiting for more parallel work.

After completing the execution of a parallel region, threads wait for new parallel work to become available. After a certain period of time has elapsed, they stop waiting and sleep. Until more parallel work becomes available, sleeping allows processor and resources to be used for other work by non-OpenMP threaded code that may execute between parallel regions, or by other applications.

The amount of time to wait before sleeping is set either by the KMP_BLOCKTIME environment variable or by the kmp_set_blocktime() function. A small blocktime value may offer better overall performance if your application contains non-OpenMP threaded code that executes between parallel regions. A larger blocktime value may be more appropriate if threads are to be reserved solely for use for OpenMP execution, but may penalize other concurrently-running OpenMP or threaded applications.

And the description of KMP_BLOCKTIME:

Sets the time, in milliseconds, that a thread should wait, after completing the execution of a parallel region, before sleeping.

Use the optional character suffixes: s (seconds), m (minutes), h (hours), or d (days) to specify the units.

Specify infinite for an unlimited wait time.

See also the throughput execution mode and the KMP_LIBRARY environment variable.

These descriptions seem to favor Jim's option a):each team thread, upon reaching the end of its parallel section, will wait for KMP_ BLOCKTIME (minimum 0, minimum discrete interval 1 ms, default 200 ms) before going to sleep.

James_Grimplin · ‎06-08-2009

Quoting - Robert Reed (Intel)

A couple observations here. Not all of MKL is threaded. MKL uses threading for those operations where they expect parallel code to gain performance. I didn't see anything specific about what part of MKL is being used in this test, but the time-per-packet numbers suggest that the portion of MKL being employed runs serially.

The other operation regards Jim's comments on KMP_BLOCKTIME. First, it is a KMP (Intel extension)option, not an OpenMP option so it's probably not something used by MKL. Also Jim suggests Intel OpenMP"...will sit burning CPU for up to 200ms while waiting for thread processing loop completes...." It's actually a little worse than that. The KMP_BLOCKTIME interval starts when the threaded loop completes; that is, after all the threads have made theirrendezvous following the loop. Its purpose is to keep threads from paying the penalty of going to sleep and then waking up in regions of code comprising several parallel sections separated by short bursts of serial code. And the default, last time I checked, was 200 ms, which is a long time these days. With the speed of modern computers, I've recommended a setting of 1, 1 ms being sufficient time to hide lots of serial code. At least one application I know of uses a KMP_BLOCKTIME setting of 1 in a couple of places where they have chains of OpenMP serial code and 0 otherwise (to keep other parallel pools like TBB from yielding while OpenMP spins).

We are not using much of MKL, very little in fact. We are using the dfti and vsl components. I have not evaluated the extent of use of these MKL components at this point (the code already existed).

We did run a test earlier in which we determined that setting KMP_BLOCKTIME to 1 ms or less (we have run at 100 us) greatly improves utilization (eliminating burning of CPU cycles). We get down to the 15-20% utilization in all MKL_NUM_THREAD cases. (The release notes for the Intel compiler briefly mention adjusting KMP_BLOCKTIME, not that I am currently using the Intel compiler, but I could be if it made a difference)

So we have made headway on the MKL front, but TBB is still burning cycles, about 12% on the octal core machine.

derbai · ‎07-25-2009

hi, I have seen similar issues with tbb pipleine spinning at 100% cpu when idle.
The first stage of my pipeline pulls data from a blocking queue. Data arrives in bursts (1000's a second) then none for many seconds.

I was interested to know whether any of the ideas suggested above actually worked?

derbai · ‎07-25-2009

Quoting - derbai

hi, I have seen similar issues with tbb pipleine spinning at 100% cpu when idle.
The first stage of my pipeline pulls data from a blocking queue. Data arrives in bursts (1000's a second) then none for many seconds.

I was interested to know whether any of the ideas suggested above actually worked?

And as a follow up question:

I am thinking abour having a number of live pipelines within the same app. Each pipeline would have as its input stage a bursty data set.
Presumably, the approach of sending dummy items down the pipeline until the main thread is located only works in the case of a single pipeline?

RafSchietekat · ‎07-26-2009

"Presumably, the approach of sending dummy items down the pipeline until the main thread is located only works in the case of a single pipeline?"
That was just an idea for a possible solution but without guarantee, and I think I even indicated that it could also exacerbate the problem sometimes, so it's up to you to decide whether you want to experiment with this. When running multiple pipelines at the same time, from different tbb_threads (or equivalent), there's an additional problem of starvation of some of the pipelines, and I'm not sure whether that issue has been solved yet (this has been discussed elsewhere in this forum).

What should always work is to loop over finite batches and run each to completion using TBB, but then you have to trade off low latency vs. good efficiency (bigger batches have higher latency but lead to better efficiency, and vice versa). Left to its own devices, TBB does not provide global progress, and is intentionally unfair in order to provide the best throughput for finite jobs. Perhaps you could dynamically decide batch size: monitor TBB for latency (but how, and how scalable could this be?), and when starvation is suspected cut off all input until the stuck task is sucked through.

Salomon__Uriel · ‎06-29-2018

We are experiencing the same problem as James Grimplin had in 2009. We are dependent on an input that can change in size, so our algorithm could process it sometimes fast and sometimes slow. When it processes it fast, we see the main thread system cpu usage spike. Following the advice we read from from another thread, we made the main thread sleep when the processing was done quickly, thus preventing the tbb schedular from hogging the cpu. But this is just a hack.

It's been quite a while since this problem was first recognized. Were any options added to tbb or the schedular to fix this problem? Or is there any other, better solution than the one I mentioned above?

tbb::pipeline instance using excessive CPU when idle