- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, the description of the behavior is basically that the CPU (or a core) is kept busy doing nothing when the pipeline should be idle as there is no input data available and all processing for the previous data is complete. However, this excessive utilization only occurres on every other input data.If the data rate is slow enoughone can watch the CPU utilization bounce from 0 to 50 (on my Core 2 Duo).
I have included the slightly modified text_filter.cpp to illustrate the issue. It was modified to allow more time between iterations (so you can actually see the CPU activity) and to not run in serial mode first. Note that this is the same behavior we see in our application and my input data rate is being controlled externally (socket interface) and not with artifical delays.
Any thoughts on why the pipeline is behaving in this manner?
Modification summary:
- added #include "tbb/tbb_thread.h"
- added in MyOutputFilter::operator():
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
- commented out serial run in main()
Here is the full modified text_filter.cpp example code:
#include "tbb/pipeline.h"
#include "tbb/tick_count.h"
#include "tbb/task_scheduler_init.h"
#include "tbb/tbb_thread.h"
#include
#include
#include
#include
using namespace std;
//! Buffer that holds block of characters and last character of previous buffer.
class MyBuffer {
static const size_t buffer_size = 10000;
char* my_end;
//! storage[0] holds the last character of the previous buffer.
char storage[1+buffer_size];
public:
//! Pointer to first character in the buffer
char* begin() {return storage+1;}
const char* begin() const {return storage+1;}
//! Pointer to one past last character in the buffer
char* end() const {return my_end;}
//! Set end of buffer.
void set_end( char* new_ptr ) {my_end=new_ptr;}
//! Number of bytes a buffer can hold
size_t max_size() const {return buffer_size;}
//! Number of bytes appended to buffer.
size_t size() const {return my_end-begin();}
};
static const char* InputFileName = "input.txt";
static const char* OutputFileName = "output.txt";
class MyInputFilter: public tbb::filter {
public:
static const size_t n_buffer = 8;
MyInputFilter( FILE* input_file_ );
private:
FILE* input_file;
size_t next_buffer;
char last_char_of_previous_buffer;
MyBuffer buffer[n_buffer];
/*override*/ void* operator()(void*);
};
MyInputFilter::MyInputFilter( FILE* input_file_ ) :
filter(serial_in_order),
next_buffer(0),
input_file(input_file_),
last_char_of_previous_buffer(' ')
{
}
void* MyInputFilter::operator()(void*) {
MyBuffer& b = buffer[next_buffer];
next_buffer = (next_buffer+1) % n_buffer;
size_t n = fread( b.begin(), 1, b.max_size(), input_file );
if( !n ) {
// end of file
return NULL;
} else {
b.begin()[-1] = last_char_of_previous_buffer;
last_char_of_previous_buffer = b.begin()[n-1];
b.set_end( b.begin()+n );
return &b;
}
}
//! Filter that changes the first letter of each word from lower case to upper case.
class MyTransformFilter: public tbb::filter {
public:
MyTransformFilter();
/*override*/void* operator()( void* item );
};
MyTransformFilter::MyTransformFilter() :
tbb::filter(parallel)
{}
/*override*/void* MyTransformFilter::operator()( void* item ) {
MyBuffer& b = *static_cast
int prev_char_is_space = b.begin()[-1]==' ';
for( char* s=b.begin(); s!=b.end(); ++s ) {
if( prev_char_is_space && islower((unsigned char)*s) )
*s = toupper(*s);
prev_char_is_space = isspace((unsigned char)*s);
}
return &b;
}
//! Filter that writes each buffer to a file.
class MyOutputFilter: public tbb::filter {
FILE* my_output_file;
public:
MyOutputFilter( FILE* output_file );
/*override*/void* operator()( void* item );
};
MyOutputFilter::MyOutputFilter( FILE* output_file ) :
tbb::filter(serial_in_order),
my_output_file(output_file)
{
}
void* MyOutputFilter::operator()( void* item ) {
MyBuffer& b = *static_cast
int n = fwrite( b.begin(), 1, b.size(), my_output_file );
if( n<=0 ) {
fprintf(stderr,"Can't write into %s file\n", OutputFileName);
exit(1);
}
tbb::this_tbb_thread::sleep(tbb::tick_count::interval_t (10.0));
printf("Done\n");
return NULL;
}
static int NThread = tbb::task_scheduler_init::automatic;
static bool is_number_of_threads_set = false;
void Usage()
{
fprintf( stderr, "Usage:\ttext_filter [input-file [output-file [nthread]]]\n");
}
int ParseCommandLine( int argc, char* argv[] ) {
// Parse command line
if( argc> 4 ){
Usage();
return 0;
}
if( argc>=2 ) InputFileName = argv[1];
if( argc>=3 ) OutputFileName = argv[2];
if( argc>=4 ) {
NThread = strtol(argv[3],0,0);
if( NThread<1 ) {
fprintf(stderr,"nthread set to %d, but must be at least 1\n",NThread);
return 0;
}
is_number_of_threads_set = true; //Number of threads is set explicitly
}
return 1;
}
int run_pipeline( int nthreads )
{
FILE* input_file = fopen(InputFileName,"r");
if( !input_file ) {
perror( InputFileName );
Usage();
return 0;
}
FILE* output_file = fopen(OutputFileName,"w");
if( !output_file ) {
perror( OutputFileName );
return 0;
}
// Create the pipeline
tbb::pipeline pipeline;
// Create file-reading writing stage and add it to the pipeline
MyInputFilter input_filter( input_file );
pipeline.add_filter( input_filter );
// Create capitalization stage and add it to the pipeline
MyTransformFilter transform_filter;
pipeline.add_filter( transform_filter );
// Create file-writing stage and add it to the pipeline
MyOutputFilter output_filter( output_file );
pipeline.add_filter( output_filter );
// Run the pipeline
tbb::tick_count t0 = tbb::tick_count::now();
pipeline.run( MyInputFilter::n_buffer );
tbb::tick_count t1 = tbb::tick_count::now();
// Remove filters from pipeline before they are implicitly destroyed.
pipeline.clear();
fclose( output_file );
fclose( input_file );
if (is_number_of_threads_set) {
printf("threads = %d time = %g\n", nthreads, (t1-t0).seconds());
} else {
if ( nthreads == 1 ){
printf("single thread run time = %g\n", (t1-t0).seconds());
} else {
printf("parallel run time = %g\n", (t1-t0).seconds());
}
}
return 1;
}
int main( int argc, char* argv[] ) {
if(!ParseCommandLine( argc, argv ))
return 1;
if (is_number_of_threads_set) {
// Start task scheduler
tbb::task_scheduler_init init( NThread );
if(!run_pipeline (NThread))
return 1;
} else { // Number of threads wasn't set explicitly. Run single-thread and fully subscribed parallel versions
{ // single-threaded run
//tbb::task_scheduler_init init_serial(1);
//if(!run_pipeline (1))
// return 1;
}
{ // parallel run (number of threads is selected automatically)
tbb::task_scheduler_init init_parallel;
if(!run_pipeline (0))
return 1;
}
}
return 0;
}
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Therefore ifa worker thread executes the input stage and get blocked there, the master thread will still do busy-waiting. But if the master thread blocks, all workers will fall asleep. I think that explains the "instability" of timing you got.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
TBB and blocking don't mix very well. Maybe you should consider waiting for a limited time only (zero or non-zero), and then restarting the pipeline when new data is detected, if you can suffer the extra latency associated with draining the pipeline (average and maximum). Depending on the statistical characteristics of the input, some tuning of the input time limit may be required for best results. Even if you aren't looking at energy consumption, a stalled pipeline should be stopped to prevent different degrees of starvation elsewhere in the program.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well that is very interesting information and it does explain the behavior we are seeing. However, I would have to say that although this behavior may be expected, given insight into the implementation, it is a shortcoming of the implementation. I have not looked at the implementation so I can not comment on why, or even how, it does what it does, but I believe that TBB needs deal with blocking (I/O in particular)for it to be a successful abstraction allievating the programmer of the details of threaded programming. The pipeline abstraction appeared to be a very good match for our number crunching application allievating us of the glue and housekeeping code of a traditional threaded approach. It did in fact greatly reduce the amount of code we had to generate, but unless I can find a workaround I fear that we are going to have to reimplement the application in the traditional manner wherewe can insure that idle tasks (threads) are in fact idle.
Starting and stopping the pipeline is not a reasonable approach from my perspective. Again, it basically complicates what seemed to be an elegant solution. I could easily cause the pipeline to terminate, but all of the additional changes that would be required to deal with the input interface would be complicated by the fact that things are starting up and terminating repeatedly.
Any other ideas on a workaround?
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then where does the blocking arise?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Then where does the blocking arise?
Blocking occurres on the input. I do not control the rate at which data arrives. It's variable and we block (via a select) on a socket waiting for the arrival of the data. Generally two "packets" of data will arrive one after the other and we then can process them concurrently (minimizing latency) before sending the processed data on to the final output stage, which is also a socket interface. The data processing is pure number crunching. It could be further decomposed into multiple pipeline stages, but initially we choose not to rework the algorithm until we determined if the pipeline approach would work in our aplication. Future uses of the application will provide an increase in the input data rate. I believe the pipeline architecture would allow us to deal with the increased data rate (and increased processing) by either providing the ability to run more processing stages concurrently, and/or allow us to further decompose the processing into multiple stages.
So even if wewere to move the socket interface out of the input stage andimplement multiple stages for the processing, the data feed to the pipeline will still be be variable resulting in a situation where the pipline is idle at times and the master thread is spinning. There is no way to eliminate this variability in our application.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It's somewhat unfortunate, but it seems difficult to find a generally applicable mechanism that's also at the same time the best performing and the most elegant in each particular situation, so this one doesn't really look unreasonable to me, although I wouldn't be surprised, just interested, if somebody could come up with something better. We're not yet talking about several pipelines getting entangled and starving each other, just a few more lines of code...
"I do not control the rate at which data arrives."
Still, the statistical characteristics are important for deciding on a good solution. For example, if you have long waits between bursts, it makes more sense to drain and stop the pipeline. Otherwise there may be an optimum timeout based on the statistical characteristics of the input, the acceptable latency, the amount of work for each data item, the level of optional parallelism for each work item, the number of cores, and perhaps other parameters. Difficult to say. But please take this only at face value and make up your own mind!
Still, perhaps if a task could say, hey, I'm going to be waiting now, and when I wake up I'm not going to be doing a whole lot to compete with other threads until I finish, so meanwhile you might as well dispatch another worker from among the excess ones, TBB might just be more blocking-friendly than it is now...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Characteristics of the data are straightforward:
- 2 packets (~10KBytes) received periodically (350 ms interval,shortestperiod today)
- Data rate is generally constant for some period of time (minutes to hours)
- Data rate can slow to a pair of packets every >1 s
- Each input data packet is processed and generates an output data packet of ~2 MBytes
- The processing stage takes ~50 msec to process an input packet into an output packet (3.17 GHz Core 2 Duo)
- The processing of a packet is primarily a sequential operation (it can be divided into stages)
Yes, I could have implemented this as a single thread process that waits for a message, processes it, and finally outputs the processed data before looping back and starting all over. However, that implementation has scaling issues as we move forward in terms of increased data rates and sizes as well as algorithim refinements that may require additional processing time. Additionally, there are other aspects of the application that complicate the solution.
"just a a few more lines of code..."
Well the input interface is a bit more complicated than may be apparent from my simple description. Additionally, there would be more overhead in this approach. I'm not sure if just re-running a cleared pipeline would still incurr the initial startup overhead we have seen, but there is at least some additional burden with the approach although it may be less than the wasted CPU cycles currently being burned. It's not that the approach itself is completely unreasonable, I just think the benefits of the pipeline may be starting to evaporate. The appealing aspects of the pipeline were minimal management and code while being provided concurrency.
"Still, perhaps if a task could say, hey..."
Sorry, I'm not following the though in this paragraph.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"- The processing stage takes ~50 msec to process an input packet into an output packet (3.17 GHz Core 2 Duo)"
Sorry, I only described the smaller input data packet. The worst case processing timing is ~150 msec per input packet for our large (~100 KBytes) packets, whileour smaller input packet (~10KBytes) can be processed in ~50 msec. Its with the smaller packets, and thus lower processing time, where we see the impact of the pipeline thread spinning since there is significantly more idle time in that scenario.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Those are indeed long waits in computer time. I would still think of running select for 50-100 ms (ballpark, should be corrected by tuning, I suppose) before shutting down the pipeline (hmm, how long does a restart take? what are the latency requirements?). But if arrival times are predictible, they could be used instead of simple fixed timeouts.
"I'm not sure if just re-running a cleared pipeline would still incurr the initial startup overhead we have seen"
I hope you're not confusing pipeline startup with task_scheduler_init startup... a good approach is to have at least one long-lived task_scheduler_init somewhere in main(), i.e., for the lifetime of the process.
"there is at least some additional burden with the approach although it may be less than the wasted CPU cycles currently being burned"
Is that an environmental or electricity cost concern (I'm all for that resp. sympathetic), or related to other processes on the same system? The main thread wouldn't be violently twiddling its thumbs if there were an eligible task somewhere ready for the stealing.
"The appealing aspects of the pipeline were minimal management and code while being provided concurrency."
You might also have a look at parallel_while (which might as well be undeprecated), but I guess it would have the same problem (no time to go and verify that now, though).
"Sorry, I'm not following the though in this paragraph."
That was just about a possible change to TBB itself, although minutes after that I started to wonder what it would do to the transitive relation for the sake of affinity between tasks and caches by way of threads... Another suggestion I would make is to have a way to concurrently revive a pipeline while it is shutting down; I might have a look at that one myself sometime.
OK, a possible workaround to reduce the spinning: when you detect that there is no work pending, check whether you're in the main thread; if you're not, send a dummy through the pipeline, thus giving another thread, hopefully the main thread, the chance of detecting the same condition, at which time you just block.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, a possible workaround to reduce the spinning: when you detect that there is no work pending, check whether you're in the main thread; if you're not, send a dummy through the pipeline, thus giving another thread, hopefully the main thread, the chance of detecting the same condition, at which time you just block.
Sorry for butting in but we have a similar issue with pipeline blocking and your workaround sounds like it fits the bill. JUst some questions.
1) It sounds like if the main thread (the thread that did task_scheduler_init and started the pipeline) blocks on input, then the pipeline wont spin. On the other hand if a worker thread blocks on the input filter first the pipeline will spin. I hope I read your message correctly ?
2) How do we determine whether or not a particular filter is executing in the main thread ? (Sorry if this is a basic question, I cant find an answer in the doc or in the O Reilly book)
3) If we send a sufficiently large number of such dummies (say 1000), can we be sure of the main thread seeing one and blocking ? Is there a way to send a dummy specifically targeted at the main thread ?
My problem is similar to the OP's i.e input blocking. I will post details in another post because I dont want to steal the OP's thread (nopun).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's what I understood from Alexey Kukanov in #1.
"2) How do we determine whether or not a particular filter is executing in the main thread ? (Sorry if this is a basic question, I cant find an answer in the doc or in the O Reilly book)"
Record the main thread's tbb_thread::get_id() for later reference, I suppose.
"3) If we send a sufficiently large number of such dummies (say 1000), can we be sure of the main thread seeing one and blocking ? Is there a way to send a dummy specifically targeted at the main thread ?"
No, and therefore the total busy-waiting may get a lot worse, but I think the likelihood of that happening may be low enough to try this (a necessary condition is that processing of data items be bounded). But just to be sure, when I suggested to "send a dummy through the pipeline", I meant a specially marked pointer or referent (NULL from the input filter is reserved for shutting down the pipeline), that later filters know to ignore, just to give hopefully the main thread a chance to execute the input filter (wait with zero or short timeout, if nothing detected and also the main thread block forever). This is only if the pipeline is the main event loop of the process, of course, otherwise other complications may arise. And note that it's just an idea, I haven't tried it myself yet, and I would still go with my first suggestion instead (shut it down), or try something more radical.
"My problem is similar to the OP's i.e input blocking. I will post details in another post because I dont want to steal the OP's thread (nopun)."
I myself would only look at whether the subject/title covers it (create tasks, not threads). And I'll pretend that you didn't suggest that a thread could get stolen (in TBB, it's the thread that does the stealing). :-)
(Added) It also seems fairly easy to see, unless that's just because I would be mistaken,that for this workaround to be at all beneficial, the amount of time spent on work vs. being blocked must be low enough, going down with the number of threads involved; it may help to use a non-zero timeout.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James,
I was hoping that Alexy or Robert would be able to help you with your pipeline problem using TBB.
In QuickThread (a tool I wrote) we have two classes of tasks: Compute and I/O so the I/O portions of the pipeline application can be performed by the I/O clases of threads. Also, althoughthe "main" which generally launches the thread pool, but once the pool islaunched, there is no "main" thread, at least from a task scheduling viewpoint. The launching thread, which may be the main thread or other thread launched by the main thread, does have the responsibility of enqueuing the top-level task (or simply assuming the role of the top-level task). Other than that, there is no other distinction. As long as any task are remaining to run, none of thethreads will end. When task stealing cannot find a task to run (and after a tunable spinwait time) the thread suspends on WaitForSingleEvent. Which will occure if athread is waiting and an appropriate task is enqueued. Other than for a short spinwait time you won't see a long burn time when insufficient tasks are enqueued and threads are starved for work. In the case of the pipeline, as long as any tasks remain, be they I/O or compute class, the entire thread pool remains intact. When threads starve for work, they suspend, when new work arrives, they are awoken.
Rewriting the TBB sample upcase words test and running on a 4 core Q6600 (XP x64). 1,000,000 lines of "the quick brown fox jumped over the lazy grey dog's back nnn.... " where nnn... is a sequence number to assure the upcased outputbuffers were collated properly(file size is 64.345MB)
[cpp]SerialRunUpcaseWordsTest = 5.440969 seconds ParallelRunUpcaseWordsTest = 1.415301 seconds Serial/Parallel = 3.844389 [/cpp]BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.
(add files does not seem to open a browse, else I would have attached the QuickThread version of the parallel pipeline upcase words test for you to see the difference in programming style)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you initialize TBB by default (with no explicit arguments passed to task_scheduler_init), it will create 1 worker thread on Intel Core 2 Duo processor, and 7 worker threads on the second system you mentioned.
My guess about ~60 threads is that you use another parallel library, such as OpenMP, either implicitly or explicitly, and it initializes its own thread pool for each thread that uses it. Thus the total number of threads is around square of the number of cores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you initialize TBB by default (with no explicit arguments passed to task_scheduler_init), it will create 1 worker thread on Intel Core 2 Duo processor, and 7 worker threads on the second system you mentioned.
My guess about ~60 threads is that you use another parallel library, such as OpenMP, either implicitly or explicitly, and it initializes its own thread pool for each thread that uses it. Thus the total number of threads is around square of the number of cores.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
James,
I was hoping that Alexy or Robert would be able to help you with your pipeline problem using TBB.
In QuickThread (a tool I wrote) we have two classes of tasks: Compute and I/O so the I/O portions of the pipeline application can be performed by the I/O clases of threads. Also, althoughthe "main" which generally launches the thread pool, but once the pool islaunched, there is no "main" thread, at least from a task scheduling viewpoint. The launching thread, which may be the main thread or other thread launched by the main thread, does have the responsibility of enqueuing the top-level task (or simply assuming the role of the top-level task). Other than that, there is no other distinction. As long as any task are remaining to run, none of thethreads will end. When task stealing cannot find a task to run (and after a tunable spinwait time) the thread suspends on WaitForSingleEvent. Which will occure if athread is waiting and an appropriate task is enqueued. Other than for a short spinwait time you won't see a long burn time when insufficient tasks are enqueued and threads are starved for work. In the case of the pipeline, as long as any tasks remain, be they I/O or compute class, the entire thread pool remains intact. When threads starve for work, they suspend, when new work arrives, they are awoken.
Rewriting the TBB sample upcase words test and running on a 4 core Q6600 (XP x64). 1,000,000 lines of "the quick brown fox jumped over the lazy grey dog's back nnn.... " where nnn... is a sequence number to assure the upcased outputbuffers were collated properly(file size is 64.345MB)
[cpp]SerialRunUpcaseWordsTest = 5.440969 seconds ParallelRunUpcaseWordsTest = 1.415301 seconds Serial/Parallel = 3.844389 [/cpp]BTW, the I/O buffer sizes are 10,000 bytes (same as in TBBexample). A larger buffer might have yielded better performance (i.e. multiple of sector size). I have not run this against the equivilent TBB sample yet.
(add files does not seem to open a browse, else I would have attached the QuickThread version of the parallel pipeline upcase words test for you to see the difference in programming style)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No, it is a seperate threading tool with its own thread scheduler, templates, headers, etc...
if you Google
QuickThread site:intel.com
You will find a .PDF file uploaded several months ago.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
MKL uses OpenMP, which was not designed for compatibility with other threading packages. In OpenMP, each parallel region is executed by the "team" of threads. By default, if you have two parallel regions started concurrently by different threads, each one will be served by a separate OpenMP thread team. If each TBB worker thread calls an MKL function parallelized with OpenMP, there are 64 threads in the app (TBB, on the contrary, uses singlethread pool, so if you called TBB algorithms from an OpenMP region, there would be 15 threads in total -the main thread, 7 OpenMP workers, and 7 TBB workers).
In addition, in Intel's OpenMP implementation by default threads do not park immediately after the parallel region ends, in the assumption that another one can start right after. So they idle-spin for a little. This behavior is controlled by an environment variable and possibly an API function as well. I do not know whether MKL zeroes this active wait time for OpenMP or not.
It might make sense to use single-threaded version of MKL, unless the outer-level parallelism exploited with TBB is not enough to utilize all processors and/or to provide proper load balancing. You also might reduce the number of threads in every OpenMP team (e.g. to 2), to minimize oversubscription. Generally, I would recommend you to contact Intel Premier Support for help with using MKL functions from TBB.
As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."
Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?
BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"As for the CPU utilization: the percentage would not necessarily reduce with increased number of cores; I would expect the execution time to decrease instead. If you talk about idle-spinning time only, then I agree CPU utilization should decrease; but the behavior I described above might be responsible for additional idle spinning."
Alexey, thanks for some insight into MKL. We will attempt to place some control on MKL's configuration and see what happens relative to the threads. However, I am not following your comment on the system utilization. I originally had 2 cores being consumed for a total of 60%, generating a result on an input data packet in about 150 ms. I move to 8 cores and now I have 8 cores being consumed at 60% (of the total system) and still generating results in about 150 ms. If you are implying that a core's utilization should not change much, I would generally agree (depending on how mcuh MKL is decomposing the work it is being given) . But to imply that each core is being consumed at 60%and the final result is the same as on a 2 core machine, is not adding up. I'm using 4x the processing of the Core 2 Duo and achieving the same result?
BTW, The algorithmic portion of this app is a port of existing code that used MKL on a 3 machine cluster of dual Intel Itaniums (SGI Altix). When it runs is uses about 20% of the total CPU resources available (at least in one operational mode).
James,
I think you missed Alexey's remark regarding the OpenMP block time.
If, for a sillyexample, the block time is set for 200ms, and you run an #pragma omp, then #pragma omp for with a qualifier on it that results in one thread using the for then here is what you have
8 threads started to run the parallel region, one thread selected to run the loop, seven of the threads skip over the loop and due to block time of 200ms, will sit burning CPU for up to 200ms while waiting for thread processing loop completes. Should the loop take longer than 200ms (not likely) then your system lost 7 x 200ms of computation time. Should the loop take 1ms, and if the time spent by the controling thread is 10ms before it returns to the same section of code (without interviening parallel regions), then those seven other threads will burn those 11ms doing nothing. Upon entry to the parallel region, those seven threads will have their block time reset to 200ms.
So in this obscure hypothetical case, you will burn 100% of eight cores when the work was actualy being done with one core.
This is not to say blocking time isn't important. Correct use of blocking time results in greater throughput for the application (at the expense of available CPU time for other applications).
http://comp.astro.ku.dk/Twiki/view/CompAstro/IntelOpenMP?skin=print.pattern
Here is a link to a short answer.
I do not know why this is not so easy to find in the documentation.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page