- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
One of my applicationsuses pipeline processing.
When the the input data size is small (below 5K), I set the number of tokens that can be executed parallely
to two. I am executing this application in aneight core machine. When the number of threadsto be executed is chosen by the TBB automatically , TPS is going down. If I set number of threads to be executed totwo explicitly by thetbb:task_scheduler_init object, performance is almost 100%more than the previous case.
Now how can I restrictTBB so thatonlytwo specific threadswill be used for processing (Even if there are 8 threads created automatically) ?
Thank You
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes both cases I use
"pipeline.run(NUM_INPUT_TOKENS ); "
where NUM_INPUT_TOKENS = 2;
Only difference is for scenario where TPS is more , I explicitly restrict the number of
threads createdto two where as the other scenario TBBcreates 8 threads(since there are
eight cores) . Also when there are eight threads , the CPU usage is above 90%. May be this
is an indication that all the threads are used to execute the two tokens - but only two get
the tasks at a time.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But have you tried a higher value for NUM_INPUT_TOKENS, such as a default_num_threads or a small multiple of it? If not, is there a reason you didn't that might be relevant to solving the mystery? What is the topology (how many filters, and what kind)?
If that above-90% CPU usage is from a possible 0-800%, then the high below-100 value might be the main thread quite predictably twiddling its thumbs (TBB doesn't want it to go to sleep entirely), otherwise I wouldn't know.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What you want to do is measure your TPS (transactions per second?) under differing circumstances (e.g. 8 threads 2 tokens, etc... with 6 idel threads).
Note, if the 8 threads are busy when you start your pipeline with 2 tokens, then the pipeline may run using one thread (the one issuing the parallel_pipeline.run). And thus exhibit half the TPS.
For testing, try this
a) init with 8 threads in TBB task pool
b) issue dummy parallel_for that is sufficiently large to use all 8 threads
c) after parallel_for (in main/serial code) issue a Sleep() to suspend for 1 or 2 seconds
d) read timestamp at start
e) run parallel pipeline with 2 tokens
f) read timestamp at end
g) compute time and print out TPS
For test 2 use test 1 setup but omit step 3 above (eliminate sleep)
For test 3
a) init with 8 threads in TBB task pool
b) issue dummy parallel_for that is sufficiently large to use all 8 threads
c) issue parallel_invoke with 2 tasks
c.1) issue dummy parallel_for that is sufficiently large to use all 8 threads
and run several times longer than parallel_pipeline
c.2.a)read timestamp at start
c.2.b) run parallel pipeline with 2 tokens
c.2.c) read timestamp at end
d) compute time and print out TPS
This will provide you with some idea of the variance in TPS due to differing load conditions.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My input size is around 450 bytes. Performance was too low if we execute a pipeline process in which each parallel thread gets 450/8 (approx).Then I found for this scenario NUM_INPUT_TOKENS = 2 gives a significantly higher performance. Then came the observation that if I reduce the number of threads to 2, it further improves the TPS.Now I have a "selecter" function which will decide the "number of input tokens" based on the size of input data. If number of thread can alsobe selected like that, I would have done that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post your code?
Maybe there is something in the code that you think is innocuous but in fact is negativelyimpacting the scalability of your pipeline.
As Raf says, it seems strange to have somewhat negative scalability.
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am pasting asample program below which shows exactly the same problem.
When I excuted it I got following results (In a dual quad core(2 X Xeon E5640))
TPS (8 Threads,2 Tokens) : 316944
TPS (2 Threads,2 Tokens) : 559019
[bash]#include "tbb/tick_count.h" #include "tbb/pipeline.h" #include "tbb/tick_count.h" #include "tbb/task_scheduler_init.h" #includeusing namespace std; using namespace tbb; #define MAX_INPUT_TOKEN 8 static char data[] = "Beat the 12 eggs, 2 cups sugar and vanilla or lemon extract together. Stir in the ricotta cheese and the chocolate chips. Set aside. Preheat oven to 325 degrees F (165 degrees C). Grease two deep dish pie plates. Combine the flour, baking powder, and 1 cup of the sugar together. Cut in the shortening and mix until the mixture resembles coarse crumbs. Mix in 4 of the eggs and 1 teaspoon of the vanilla. Divide dough into 4 balls and chill (if needed)"; class TextSlice { const char* pBuffer; size_t bufLen; size_t count; char charToCount; public: TextSlice() {} ~TextSlice(){} void setInput(const char* pBuf, size_t len ,char charToCnt ) { pBuffer = pBuf; bufLen = len; charToCount = charToCnt; count = 0; } void countChar() { size_t len = bufLen; while(len--) { if(pBuffer[len] == charToCount) { ++count; } } } size_t getCount() { return count; } }; class MyInputFilter: public filter { private: TextSlice inputTokens[MAX_INPUT_TOKEN]; const char* pBuffer; size_t bufLen; size_t unitLen; size_t numInputToken; char charToBeFound; int index; public: MyInputFilter(): filter(serial_in_order){} ~MyInputFilter(){} void setInput(const char* pBuf, size_t len , size_t numTokens, char charToFind ) { index = -1; pBuffer = pBuf; bufLen = len; numInputToken = numTokens; charToBeFound = charToFind; unitLen = bufLen/numTokens; } void* operator()(void* pItem) { if(bufLen) { TextSlice* pChunk = &inputTokens[ ++index % numInputToken]; if(bufLen >= unitLen) { pChunk->setInput(pBuffer,unitLen,charToBeFound); pBuffer += unitLen; bufLen -= unitLen; } else { pChunk->setInput(pBuffer,bufLen,charToBeFound); bufLen = 0; } return pChunk; } return 0; } }; class MyTransformFilter: public filter { public: MyTransformFilter() : filter(parallel) {} ~MyTransformFilter() {} void* operator()( void* item ) { ((TextSlice*) item)->countChar(); return item; } }; class MyOutputFilter: public filter { size_t count; public: MyOutputFilter( ): tbb::filter(serial_in_order) {} size_t getCount() { size_t cnt = count; count = 0; return cnt; } void* operator()( void* item ) { count += ((TextSlice*)item)->getCount(); return item; } }; double measureTPS(size_t numTokens ) { MyInputFilter inputFilter; MyTransformFilter transformFilter; MyOutputFilter outputFilter; pipeline pipeline; pipeline.add_filter( inputFilter ); pipeline.add_filter( transformFilter ); pipeline.add_filter( outputFilter ); size_t numLoops = 1000000; double total = 0; for(size_t i = 0 ; i < numLoops ; i++) { tick_count t0 = tick_count::now(); inputFilter.setInput(data,sizeof(data) - 1,numTokens,'a'); pipeline.run(numTokens); outputFilter.getCount(); tick_count t1 = tick_count::now(); total += (t1 - t0).seconds(); } return numLoops/total; } void measureTPS_8TH_8TK( ) { task_scheduler_init schInit(8); cout <<" TPS (8 Threads,8 Tokens) : " << measureTPS(8) <
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I yield the baton to whoever wants to explain the improved result with fewer threads, but maybe we should first also get the benchmark results for 2 threads and 8 tokens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your test program is determining the time to:
1) construct a parallel_pipeline
2) run() the parallel_pipeline (initiate tasksfor pipeline)
3) process the data stream for use by the parallel_pipeline
4) terminate the tasks of the parallel_pipeline
A TPS, which I am assuming to mean is transactions per second.
parallel_pipelines do have overhead in setup (steps 1, 2) and in shutdown (step 4). The intended use is to increase the TPS of the data stream (step 3). Therefore their intended use is for data streams of sufficient size to make the overhead of the setup and shutdown and insignificant portion of the process.
If your program requirement is to run short streams of data packets through a pipeline (and where no I/O is involved)then I would suggest you consider using parallel_invoke containing your pipe lists and where the pipes use progress barriers
...
// we are pipe 2, wait for pipe 1
while(!Pipe1Done)
_mm_pause();
...
You may have to experiment with pipe ordering in the parallel_invoke to attain optimal performance.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your reply. I never explored the parallel_invoketill your reply. Now I employ parallel_invoke and my application performance further improved. Thanks forthe same. However my original problem prevails. i.e for input data of small size, increase in number of thread affects performance negatively ( with constant parallel tokens). Any workaround?
Also in my example the time taken for construction of parallel_pipeline is not included. That is outside the loop.
Thank you
Zooli
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Raf,
If the number of token is more and input size is less, thenthat can reduce the performance and that is quite obvious (which is not the issue I am trying to get solved).If you read the sample program given , it is testing equally sized/sliced text chunks(for 8thread_2token and 2thread_2token ).Also theperformance figuresfor2thread_8Token (the 4Th benchmark ) ismuch less than 2thread_2token scenario.
Input data given for the example islarge enough to get the scale up since it is giving more performancewhen parsed with 2 threads (instead of 1) .
So the question is whyis it giving more performance for 2thread_2token scenario than 8thread_2token scenario ? . Is it because the 'task_stealing' more'intense'since there are 8 threads trying get hold of the task ?.
Thank you
Zooli
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't really know what happened there, and I don't want to guess yet (if pressed, I might offer stealing and joining overhead). But the test program is pathetically small, so it hardly seems meaningful.
"Also the performance figures for 2thread_8Token (the 4Th benchmark ) is much less than 2thread_2token scenario ."
Line 88 ("unitLen = bufLen/numTokens;") will make the program use more tasks as the number of tokens increases, each of which has negligible work to do, so the increasing overhead dominates.
"Input data given for the example is large enough to get the scale up since it is giving more performance when parsed with 2 threads (instead of 1) ."
Why don't you print those performance points, and the 8/8 I suggested? Better yet, why not parameterise those measurement functions and run 8 times 8 tests that say exactly how much "more performance" and how it varies?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes unitLen = bufLen/numTokens. will divide into same text size for 2thread_2token and 8thread_2token.
Since you were wondering about the 2thread_8Token scenario, I just stated that this scenario gives much less performance.
I measured same before posting the same.I did not post the results for "1thread_1token"and "8thread_8
tokens" because performance was less and that time that was not the issue .
The machin is not available with me. Once it is available I will publish the results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"I am not able to agree with your argument that small programs are meaningless. After all I am explaining a concept and I do not think you need a big program to explain that especially to such forums."
I meant that small input is often meaningless because you cannot reliably extrapolate the results in a context where overhead is a big part of the equation. Creating and running a pipeline a million times is not going to increase the amount of work each instance does, it only amortises thread startup.
"Since you were wondering about the 2thread_8Token scenario, I just stated that this scenario gives much less performance."
This is indeed the missing combination I asked for (not 8/8 as I wrote in #14), but, e.g., does it compare to 8/8 like 2/2 compares to 8/2 or is the ratio different? Unless somebody else can say more with what you have already provided (did I miss anything?), I can only ask for more (numerical) information to work with before I could possibly say more, and that's not even exploring realistic data sizes.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page