- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've tried the following code:
[cpp]class RenderByLines { LPDWORD* const my_lines; public: void Perform(const int from, const int to) const { LPDWORD* lines = my_lines; for (int yy = from; yy != to; yy++) { int y = 127 - (yy / 2); if ((yy & 0x1) == 1) { y = 255 - y; } LPDWORD line = linesAnd got results:; for (int x = 0; x < 256; x++) { int q = 1; for (int z = 0; z < 256; z++) { q = (q * x + z) * y; } line = x + (256 * y) + (65536 * (q & 0xFF)); } } } void operator()(const blocked_range & r) const { Perform(r.begin(), r.end()); } RenderByLines(LPDWORD* lines) : my_lines(lines) {} }; void testPF() { tick_count t0 = tick_count::now(); parallel_for(blocked_range (0, 256, 1), RenderByLines(microLines)); tick_count t1 = tick_count::now(); fprintf(debugFile, "test parallel_for: %d mksn", (int)((t1-t0).seconds() * 1000000)); } void testPFA() { tick_count t0 = tick_count::now(); parallel_for(blocked_range (0, 256), RenderByLines(microLines), auto_partitioner()); tick_count t1 = tick_count::now(); fprintf(debugFile, "test parallel_for_auto: %d mksn", (int)((t1-t0).seconds() * 1000000)); } void testS() { tick_count t0 = tick_count::now(); RenderByLines(microLines).Perform(0, 256); tick_count t1 = tick_count::now(); fprintf(debugFile, "test sequential: %d mksn", (int)((t1-t0).seconds() * 1000000)); } void PerformParallelRaytrace() { for (int i = 0; i < 8; i++) {testPF();} for (int i = 0; i < 8; i++) {testPFA();} for (int i = 0; i < 8; i++) {testS();} for (int i = 0; i < 8; i++) {testPF();} for (int i = 0; i < 8; i++) {testPFA();} for (int i = 0; i < 8; i++) {testS();} } [/cpp]
[plain]test parallel_for: 102283 mks test parallel_for: 74223 mks test parallel_for: 74463 mks test parallel_for: 78521 mks test parallel_for: 80608 mks test parallel_for: 76435 mks test parallel_for: 78746 mks test parallel_for: 76028 mks test parallel_for_auto: 77084 mks test parallel_for_auto: 79055 mks test parallel_for_auto: 69775 mks test parallel_for_auto: 74822 mks test parallel_for_auto: 80514 mks test parallel_for_auto: 76490 mks test parallel_for_auto: 92116 mks test parallel_for_auto: 92822 mks test sequential: 122995 mks test sequential: 128488 mks test sequential: 120866 mks test sequential: 121859 mks test sequential: 120693 mks test sequential: 120593 mks test sequential: 120668 mks test sequential: 120733 mks test parallel_for: 71991 mks test parallel_for: 73140 mks test parallel_for: 73169 mks test parallel_for: 71933 mks test parallel_for: 73696 mks test parallel_for: 73777 mks test parallel_for: 73173 mks test parallel_for: 77050 mks test parallel_for_auto: 77619 mks test parallel_for_auto: 74510 mks test parallel_for_auto: 76760 mks test parallel_for_auto: 73786 mks test parallel_for_auto: 74553 mks test parallel_for_auto: 73108 mks test parallel_for_auto: 73642 mks test parallel_for_auto: 73634 mks test sequential: 120947 mks test sequential: 120867 mks test sequential: 122299 mks test sequential: 120984 mks test sequential: 121005 mks test sequential: 120843 mks test sequential: 120354 mks test sequential: 120945 mks [/plain]Seems to be very strange. When there were no internal cycle (calculating q) - auto_partitioner worked good - is was about 15% faster. But on long lasting computations - auto_partitioner consumes about 2 additional milliseconds!
I can't understand this result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Wild (meta) guess: auto_partitioner has to make a guess about a good division of work between the cores, and if it guesses slightly wrongly (which is bound to happen), some cores finish earlier than others, but cannot steal work from the threads that are still busy,which is not a problem with simple_partitioner and grainsize 1, as long as the task overhead is limited relativeto the amount of work that a task does. You'll probably seeincreasing runtimes (if you want to be distinguishing such small differences) with manually increasing grainsizes, probablywith an optimum in-between. If your program is doing other things than just this and waiting until the finish, auto_partitioner should still provide best results.To verify,run the outer loop of8iterations in parallel instead(use simple_partitioner and 1), to give TBB the chance to amortise the final wait to zero, and then please report back (I'm curious).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Great idea. Don't know what was this morging, but now auto_partitioner works just as expected (may be it has a 7-th sence ;-)
Here is the new code sample:
[cpp]void testPFA() { for (int i = 0; i < 256; i++) {intervals = -1;} tick_count t0 = tick_count::now(); parallel_for(blocked_range(0, 256), RenderByLines(microLines), auto_partitioner()); tick_count t1 = tick_count::now(); fprintf(debugFile, "test parallel_for_auto: %d mksn", (int)((t1-t0).seconds() * 1000000)); for (int i = 0; i < 256; i++) {if (intervals >= 0) {fprintf(debugFile, "%d ", intervals - i);}} fprintf(debugFile, "nn"); } [/cpp]
... and new result:
[plain]test parallel_for: 60229 mks 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 test parallel_for_auto: 60136 mks 64 64 32 32 32 32 test sequential: 118471 mks [/plain]
So, thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The work could be nominally balanced (in the range domain), but actually unbalanced (some elements requiring more work than others and/or interference from other threads or processes). For the diagnostic output to really make sense (I have not checked the implementation of auto_partitioner), we'd also have to know the number of cores (the numbers in #3 seem unproblematic for 2 cores), and perhaps timing results for each body invocation (2% may be consistent with accidental differences).
Random thought (just to free my mind): how about a hint to auto_partitioner about some range statistics estimates, like maximum ratio in iteration execution times (bigger ratio=>smaller bites), perhaps correlation between neighbours, cost of splitting, ...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page