Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

setting and scanning an array on one go

Walter_D_
Beginner
2,086 Views

Hi I'd like to parallelise the following simple serial code:

tmp = Idx

for(i=0; i<n; ++i) {

  y = tmp;

  f = func(i);  // func(i) is expensive and touching f may be expensive, too

  tmp = tmp  x  f;   // x is some associative operation

}

and I want to call func(i) exactly once for each i. I could use parallel_for() to set f and thereafter parallel_scan() to set y. This touches most f three times (in parallel_for(), and in the pre- and final scans of parallel_scan()). But, algorithmically 2 touches suffice. So, I wonder whether it's possible to implement this code in parallel using just parallel_scan()?

I tried, and it's not working. I presume it's impossible, because some i are scanned twice (pre-scan and final scan), but not all. The implementation of the scans (body::operator()) would need to know whether a given final scan has been preceeded by a pre-scan or not. While parallel_scan() must have that information, it appears that it is impossible to obtain it for the implementation of body::operator().

Is my analysis correct, i.e. there is no way to implement my simple code without touching most f thrice (as opposed to twice)?

NOTE: using an array of flags or pre-setting f to a value guaranteed not to occur from func(i) is not an acceptable solution (they body require touching each element of an array).

0 Kudos
33 Replies
jimdempseyatthecove
Honored Contributor III
771 Views

Walter,

I incorporated your changes, made one small change relating to your func(int i) to my func(double).

Your code runs, however at ySerial[321500] != yTBB[321500] (N=5000000)

Used timer code I had in my app, also init code too. Maybe you can see if I goofed up.

I conditionalized out the QuickThread code

Your runtime improved significantly:

[plain]

N=5000000

Serial test
TBB test
TBB pipeline test
QuickThread test

TBB Scale: 4.969774
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial != yTBB
fSerial != fTBB


TBB pipeline Scale: 2.895157
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624

QuickThread Scale: 6.313559
resultSerial = 653402.120624, resultQuickThread = 653402.120624
[/plain]

Jim Dempsey

 

0 Kudos
Walter_D_
Beginner
771 Views

I slightly modified my earlier file to use 64-bit integer arithmetic instead of double (also: added a comparison with the serial results). This avoids floating point comparison and round-off errors and hence enables a more rigorous tests of correctness. I havn't found any error yet.

Jim, you keep on using non-standard (MS specific) code, a strange function call f=func(f), and floating-point arithmetic. Moreover, you seem not really interested in the question at hand, which was whether my implementation using tbb::parallel_scan() is correct. I will ignore any comments not aimed at answering the question. If you think there is some value in discussing various ways to solve this problem, then, I presume, you're welcome to open your own thread for that. However, I do think that promoting your own commercial software here is inappropriate.

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

Walter,

Your original description had:

[cpp]

tmp = Idx
for(i=0; i<n; ++i) {
  y = tmp;
  f = func(i);  // func(i) is expensive and touching f may be expensive, too
  tmp = tmp  x  f;   // x is some associative operation
}
[/cpp]

The above being pseudo code. Your proffered solution used a template to generate the code. For simplification, I chose to use a specific type (double), then incorporate that type into a non-template generate "func" function containing the loop . This makes the example simple and clear as there is no potential for hidden operator functions that may have side effects.

The original func function is your sample load function. You can change the argument of the function back to (int i) if you wish, it realy doesn't matter, but your comment leads to the assumption that the load function, receiving i (an index), also knows where array f[] is located. For the example load function in the sample code I provided, I did not give it access to f, principally because the same load function was to run all variations of the test, and it would not know the location of the differing f's (fSerial, fTBB, ...). Therefore, the load function func should be supplied either a) value (f), b) reference (f), c) pointer (&f), d) be a member function of some object in which the loop is enclosed (f is thus known), e) be a member function whereby the pseudo code statement becomes "f = obj.func(i);" 

T func(int)
became
double func(double)

This function is arbitrary in any event. If you want it to receive and int, you can code it that way.

Now then, choice of type. I specifically chose type double and chose to generate values with fractional values that will contain values that use the full extent of the floating point precision. The purpose of this is to assure that any possibility of sequencing error (difference) between the serial implimentation and parallel implimentation would tend to be exposed. Isn't the goal to produce a parallel implimentation of a serial algorithm that produces the same results?

RE: However, I do think that promoting your own commercial software here is inappropriate.

You are misinformed.

The QuickThread toolkit is available free (subject to GNU General Public License) in source form from www.quickthreadprogramming.com

I am in the process of re-working the site, so bear with me any inconvienecies you may have with downloads and content and documentation. The prior layout had the QuickThread libraries in binary format. Now developers (who know what they are doing) can download the source code to the library.

FWIW This library code represents over 5 man-years of my programming effort. Most of the effort was spent identifying, locating, and eliminating complicated thread interaction issues (adverse and performance) that happens inside thread scheduling code.

RE:  you keep on using non-standard (MS specific) code

Grow up and quit whining. Your complaint involves you adding:

#if derined(__linux)
... linux headers
#else
... Windows headers
#endif

I wrote the test program on a Windows platform and submitted to this forum. It is unnecessary for me to configure this for your platform, and unwarrented for you to complain about this. You can add the 6 or so lines to the top of the program.

You have a problem with your code as it gets different results from the serial code. You cannot brush this asside (we won't care about floating point round off errors...). I have spent several hours of my time trying to help you resolve your problems.

I found your simple problem a rather interesting problem for parallelization. Basically you have two statements within a loop, where the first statement depends on the result of the second statement from the prior iteration of the loop (or initial value on initial iteration of loop). This particular problem, as simple as it seams in serial format, is spectatularly difficult for someone to impliment this properly in parallel (if they have never done ths before). I've shown a) your technique is flawed (subject to your correction of my code), b) a fix for using TBB pipeline, c) a suggestion for investigation for improving the TBB method, d) a completly different and unique solving the problem using the QuickThread Parallel Manifold.

The principal problems with your implimentation (results and performance) (IMHO) (subject to code revision by you) are:

a) while the f = func(i) values are computed properly, the y sum carry from prior iteration are not. The sum carry must propigate through the fi]= results in the same sequence (else you (may) generate roundoff accumulation differences/errors).

b) As you pop out of your partitioning recursion levels, you (appear to me) to be splicing the sum carry at only the juncture of the partitions whereas the proper sum carry would be required to propigate through all the elements of the partition's y sub-range. Thus requiring each y to have number of recursion levels number of += operations (thus too attributing towards a result difference with the serial form). While your current implimentation may produce a correct total sum (by chance/approximation) the y will not be correct.

The TBB pipeline method looked promising. However it requires 3 stages. Two stages would be better, however, due to a design shortcomming in TBB pipeline, in that the initial tokens to the input stage cannot be initialized/specified. It appears that some undocumented arbitrary non-NULL void* is passed into the pipeline input state for startup. And it is not clear as to if the same non-NULL viod* passed out of the last stage is passed back to the input state. One work-around for this would be to eliminate the input stage and use an atomic<int> variable for use as "i". But you and Raf seem to be adverse in doing so. Using this would eliminate a pipelie stage, free-up a thread, and use less computational resources.

Jim Dempsey

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

P.S. Build of the QuickThread library (currently on website) fails on Ubuntu Linux. I fixed the code here today, and will configue the sample program I posted on the form for running on Linux (assuming the library still runs properly after the handful of fixes I had to make). The princpal issues related to Ubuntu requiring an update to 12.04.2, and this disturbed my Linux development environment. Once I have all working, I will send you the code, at least the library and headers for QuickThread so you can compile and link your program on your computer. I can attach those to this forum with instructions. If (later) you want the full source to the library, you can download that for free - (subject to GPL license).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

I got the code running on Linux. Getting the app to compile on Linux was problematic with the Windows functions and __int64, and intrinsics. I used GCC++. May have had less problems with Intel C++ (haven't installed it on Ubuntu yet).

Made 3 test runs to check runtimes and consistency of results. The parallel_scan failed in different points. Most likely due to issues discussed earlier. There is some variations in the runtimes, not sure what that is about. I have the QuickThread code starting up after the TBB runs. It creats its own set of threads so there may be an issue of TBB block time at end of available tasks. Cann't be sure.

------------------ run data --------------
[cpp]
Run 1
N=5000000

Serial test 51694802116 ticks
TBB test 9759266303 ticks
TBB pipeline test 17300539246 ticks
QuickThread test 8305274729 ticks

TBB Scale: 5.296997
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[468750] != yTBB[468750]
fSerial[4687500] != fTBB[4687500]


TBB pipeline Scale: 2.988046
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624

QuickThread Scale: 6.224334
resultSerial = 653402.120624, resultQuickThread = 653402.120624

Run 2
N=5000000

Serial test 51729713120 ticks
TBB test 9784810838 ticks
TBB pipeline test 17331391107 ticks
QuickThread test 9223363363 ticks

TBB Scale: 5.286736
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[937500] != yTBB[937500]
fSerial[4687500] != fTBB[4687500]


TBB pipeline Scale: 2.984741
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624

QuickThread Scale: 5.608552
resultSerial = 653402.120624, resultQuickThread = 653402.120624

Run 3
N=5000000

Serial test 51706789862 ticks
TBB test 9762459783 ticks
TBB pipeline test 17341465614 ticks
QuickThread test 8269594338 ticks

TBB Scale: 5.296492
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial[937500] != yTBB[937500]
fSerial[4687500] != fTBB[4687500]


TBB pipeline Scale: 2.981685
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624

QuickThread Scale: 6.252639
resultSerial = 653402.120624, resultQuickThread = 653402.120624
[/cpp]
---------- end run data --------
Interesting, when I place the QuickThread code in front of TBB...
and leave the QuickThread thread pool active, TBB appears to run single threaded??

If I scope the qt::qtInit init(-1,0); thread pool initiator, such that when it exits scope it disbands the QuickThread thread pool, then TBB again builds a complete thread pool (8 threads).

IOW if building a mix applicaiton (not necessarily the best thing to do) then establish the TBB thread pool first.

If anyone is interested I can post the code for your review. Will get updated archives onto website soon, I can post here if you are anxious to run the test.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
771 Views

Did you change the tests to allow for differently rounded results, though, because I wouldn't consider it "failing"? Look for "Reference Manual" above to find a quote about that.

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

Raf,

I spent last night (till 1:45 this morning), reworking my version of Walter's upload such that it closely exemplifies Walters code with a few exceptions. These exceptions are reasonable and align with Walter's original problem statement. And my code is implimented as templates - so Walter cannot gripe about this. Differences:

1) Walter's synthetic "load" fuunction used std::this_thread::sleep_for(std::chrono::nanoseconds(W)); which may suspend the thread thus taking a load off the processor (and permitting oversubscription to show additional false scaling, though oversubscription was not used). My load function is a non-floating point compute loop (using __rdtsc() and _mm_pause(), with those functions included with the main). This function could be changed to be floating point intensive or some mix of floating point, integer and memory R/W if so desired.

2) Walter's original problem statement stated "func(i) is expensive and touching f". Therefore I changed the load function to take on arguments of i and f[], and then reference f and incorporate this into the result.

3) Walter's original load function returned results that did not tax the percision (generated very small differences per i). This would lead one to assume minor round of errors. Therefore I change this computation too.

Load function:

[cpp]
   // Jim's function, generates higher degree of error in TBBscan
    auto func = [=] (std::size_t i, double* f) -> double
    {
        // stall for delayTicks clock ticks
        __uint64 t0 = __rdtsc();
        while(__rdtsc() - t0 < W)
            _mm_pause();
      return sin(f);
    };
[/cpp]

The following is an output report of one test run. I change the report format to be in line with Walter's coding style. But I expanded the error testing to print out the first 10 (up to 10) elements in error between the Serial and other runs for the f and y arrays. Results are:

[bash]
N=5000000 W=10000
Serial               15174965 micro seconds result = 653402.1206241812
TBB scan             2263908 micro seconds result = 656961.0825077833
TBB pipeline         5104435 micro seconds result = 653402.1206241812
QuickThread Manifold 2453756 micro seconds result = 653402.1206241812

TBB scan Scale: 6.702995439744018
resultSerial =  653402.1206241812
resultTBBscan = 656961.0825077833
ySerial[468750] =   106009.4110755896
yTBBscan[468750] =  106009.4110755854
ySerial[468751] =   106009.8476355567
yTBBscan[468751] =  106009.8476355525
ySerial[468752] =   106010.2841963265
yTBBscan[468752] =  106010.2841963223
ySerial[468753] =   106010.720757899
yTBBscan[468753] =  106010.7207578947
ySerial[468754] =   106011.157320274
yTBBscan[468754] =  106011.1573202698
ySerial[468755] =   106011.5938834517
yTBBscan[468755] =  106011.5938834475
ySerial[468756] =   106012.030447432
yTBBscan[468756] =  106012.0304474278
ySerial[468757] =   106012.4670122149
yTBBscan[468757] =  106012.4670122107
ySerial[468758] =   106012.9035778005
yTBBscan[468758] =  106012.9035777963
ySerial[468759] =   106013.3401441887
yTBBscan[468759] =  106013.3401441845
fSerial[4921875] =   -0.8294587162310608
fTBBscan[4921875] =  -0.7375659637302644
fSerial[4921876] =   -0.8294586000725364
fTBBscan[4921876] =  -0.7375658852913012
fSerial[4921877] =   -0.8294584839134298
fTBBscan[4921877] =  -0.7375658068519348
fSerial[4921878] =   -0.8294583677537409
fTBBscan[4921878] =  -0.7375657284121655
fSerial[4921879] =   -0.82945825159347
fTBBscan[4921879] =  -0.737565649971993
fSerial[4921880] =   -0.8294581354326167
fTBBscan[4921880] =  -0.7375655715314174
fSerial[4921881] =   -0.8294580192711812
fTBBscan[4921881] =  -0.7375654930904387
fSerial[4921882] =   -0.8294579031091633
fTBBscan[4921882] =  -0.7375654146490568
fSerial[4921883] =   -0.8294577869465634
fTBBscan[4921883] =  -0.7375653362072718
fSerial[4921884] =   -0.8294576707833813
fTBBscan[4921884] =  -0.7375652577650839


TBB pipeline Scale: 2.972898
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624

QuickThread Scale: 6.184382
resultSerial = 653402.120624, resultQuickThread = 653402.120624
[/bash]

Using the original load function the observed error in the result sum values differed only after the 12t or 13th place, leading Walter to assume a false confidence in the TBB scan code.

The above report, shows an error in the 3rd place ~10 orders of magnitued greater. This does not give me much confidence in the offered TBB scan implimentation.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
771 Views

I'm still not sure which implementation is to be considered in error. I even have more confidence in the outcome from TBB, because it is accumulating smaller values during the prescan (less rounding error than when adding small values to a larger one), before adding the results at coarser intervals, so I think parallel_scan() would have less variation when varying types between float, double and long double. For the full effect, but only if one didn't care about the additional cost of doing one more addition per element, one might even mimic that during the final scan and delay adding the accumulated result from the current chunk's prefix until just before writing each scan value.

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

Raf,

I forgot to attach the matching QuickThread archive, so here it is if you are interested in trying to run this on your Mac.

QuickThread_v3/*.*   .cpp and other files for maintaining the library
QuickThread_v3/include   include headers for use by applications and building the QuickThread library
QuickThread_v3/Release  x64 libQuickThred.a (and other .o files)

For building Walter.cpp

Add the QuickThread_v3/include path to your include path
Add the QuickThread_v3/Release path to your lib path
Add the libraries inthe order: QuickThread, rt, dl, numa, tbb

C++ options Add: -std=c++0x

Note, you will need to have boost installed too (I use one template function in the meta language), and you may need to install libnuma as well to resolve link issues (though NUMA functionality won't be used on your Mac)

Let me know if you can build Walter and if it runs on your Mac.

Jim Dempsey

0 Kudos
Walter_D_
Beginner
771 Views

I have only little time for this enterprise, but here are some observations:

My implementation of the set and scan is definitely not 100% correct, as uncovered by Jim's test. What happens is that for some i func() is called twice (the attached code explicitly tests for this). This only causes an error in the result for Jim's test, but not my original test. The reason is that while a final scan done by a body that has been reverse_join()ed has definitely been preceeded by a pre-scan, the reverse is not true: a final scan by a body that has not been reverse_join()ed may or may not have been pre-scanned. Typically, a pre-scan in this situation only occurs for the second last range.

For my particular purpose, this is not critical, because the function call is not really that expensive and because it will give the same result with every call.

Now about Jim's code. AFAICS, it scans serially, but in parallel with the set operation. If the set operation is time consuming (as specified in my original question), then this is still efficient for sufficiently few threads. However, for n_proc > cost_set / cost_scan the code will cease to scale well (Jim: you should see this). So this solution is not suitable for my situation.

Jim, at some time I will have a look at QT. How easy is it to use (compared with tbb)? TBB has extensive docu, does QT too? I'm really interested in efficient task-based parallelism. Do you know MassiveThreads? If so, any opinion?

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

Yes Walter,

"for n_proc > cost_set / cost_scan the code will cease to scale well"

This is true, but your problem statement had a high cost_set relative to cost_scan (your terms), and a requirement of cost_scan taking the value of a subsequent iteration using an associative accumulation function (into tmp) and inserting it into the stream. This by its nature is sequential (analogous to a ripple carry in an adder, however in this case your "adder" is equivilent to an analog adder as opposed to binary adder and this precludes the use of a carry look-ahead-like operation that you are trying to do with your parallel scan)

Now if your actual requirements are quite different from this why didn't you say so from the beginning?
Different problems require different solutions.

QuickThread is easier to integrate into existing applications since it can call standard functions, standard member functions (in addition to the lambda functions). You do not need special special operator() stuck into new classes. For the most part you can use your old code somewhat as-is (as-was). You will have to add a few things hear and their but not much you can keep the edits simple or go wild. With TBB you (almost) must create a Lambda function to call your old functions, then use that in your parallel_... With QuickThread you can make the call either way (with some limitations and many extensions).

void Foo(int i, int n, double A[], double B[], double C[])
{
... original function code
}
...
// in your code elsewhere
qt::parallel_for(Foo, 0, n, a, b, c);

Now here is where it gets interesting. Assume function Foo formats the data and writes it to a file. In TBB tasks that perform I/O can at times be problematic as they may be part of a nested team. QuickThread has two classes of threads: Compute, and I/O. To "fix" this, issue:

qt::parallel_task(qt::IO$, Foo, 0, n, a, b, c); // I/O class thread to perform task

Compute class threads are affinity pinned to logical processor, I/O class threads are not pinned and run with a boost in prioriy (iow, the O/S is free to resume the thread to idel logical processor)

QuickThread has a large degree of flexibility with regard to control of tasks. If you are on a large system, multi-socket NUMA machine, you can do some really interesting things. For example, you can tile rows across sockets, and columns across HW threads within a socket: (assume qt is namespaced) (sorry for the fomatting, I hope you can read this mess, copy and past into editor and indent according to scoping level)

parallel_for(OneEach_L3$,  // outer level across one thread per L3 cache (socket)
[](int iBegin, int iEnd) {
parallel_for(L3$, // inner level begun by each of one thread of each socket, teams with all other threads within L3 cache
[](int jBegin, int jEnd) {
// in here each thread sees the objects in the scope of the outer parallel_for,
// a thread specific int jBegin, int jEnd
// a socket (L3) specific int iBegin, int iEnd
... do work here in two-way tile arangement
}, // end inner lambda
0, nColumns); // slice inner loop by columns (to each thread in same L3)
}, // end outer lambda
0, nRows); // slice outer loop by rows (into each socket)

Or, if you have a floating point loop that works best using one thread per core on a system with HT

parallel_for(OneEach_L1$, ... // any one thread per core all cores (no HT siblings)

TBB is a great product for many/most of the applicaiton requirements.
QuickThread has different capabilities.
The documents and ticklers are on my website.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

Walter,

Could you point out your edits between your last two uploads. I did not see anything in particular.

Because of your problem restatement in your last posting stating you will permit rounding differences due to order of summation, I took the liberty to add an additional QuickThread solution that does not have the requirement of preserving the summation order. This techinque makes use of the qt::parallel_distribute technique with an algorithmic similar to yours: partition, produce partial sums, use partial sums to produce final sum. This technique assumes you will accept a small degree of differences between the serial and parallel implimentations.The qt::parallel_distribute method addresses your "for n_proc > cost_set / cost_scan the code will cease to scale well" issue stated earlier.

[cpp]

template<typename Func, typename T>
T QuickThread_distribute_scan_and_set(std::size_t i, const std::size_t n, Func const&func,
        T*f, T*y)
{
    T    ret;
    std::vector<T> StitchTable(qt::qt_get_num_threads());

    // first pass, partition and compute y for each partition
    qt::parallel_distribute(
            qt::AllThreads$,
            [&](intptr_t iThread, intptr_t nThreads)
            {
                qt::qtSlice<intptr_t> Slice( iThread, nThreads, i, n );
                StitchTable[iThread] = Serial_scan_and_set(Slice.iBegin+i, Slice.iEnd+i, func, f, y); // +i in event i != 0
            });

    // second pass, propagate "carry" (sum) from end of last partition to all elements of our partition
    qt::parallel_distribute(
            qt::AllThreads$,
            [&](intptr_t iThread, intptr_t nThreads)
            {
                T sum(0);
                if(iThread == 0)
                {
                    for(int i=0; i < nThreads; ++i)
                        sum += StitchTable;
                    ret = sum;
                }
                else
                {
                    for(int i=0; i < iThread; ++i)
                        sum += StitchTable;
                    qt::qtSlice<intptr_t>    Slice(iThread, nThreads, i, n);
                    for(int i=Slice.iBegin; i < Slice.iEnd; ++ i)
                        y += sum;
                }
            });
    return ret;
}
[/cpp]

To aid in determining the degree of differences I changed the report to produce an error estimation.

Also, relating to "for n_proc > cost_set / cost_scan the code will cease to scale well"

The demo program can force the issue, even with the low n_proc I have (8), by reducing the time for cost_set.

I ran tests using loads of 100000, 10000, 1000, 100, 10, 0 clock ticks:

[bash]
*** Caution W=100000 (computational weight of func) long time to run 147s for Serial
N=5000000 W=100000
Serial               147127129 micro seconds result = 653402.1206241812
TBB scan             36794242 micro seconds result = 682202.2004301561
TBB pipeline         20816859 micro seconds result = 653402.1206241812
QuickThread Manifold 18602736 micro seconds result = 653402.1206241812
QuickThread Distribute 18476807 micro seconds result = 653402.1206240761

TBB scan Scale: 3.998645467407645 resultSerial =  653402.1206241812 resultTBBscan = 682202.2004301561
 f error =0.01119778852088225 y error =-0.001588202190593666

TBB pipeline Scale: 7.067691095952564 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0

QuickThread Manifold Scale: 7.908897325640702 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0

QuickThread Distribute Scale: 7.962800553147521 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14

------------------------------------------------------------
*** lighter computational weight for func, Serial runtime 15 seconds
N=5000000 W=10000
Serial               15187231 micro seconds result = 653402.1206241812
TBB scan             2402293 micro seconds result = 660682.0901275538
TBB pipeline         5120833 micro seconds result = 653402.1206241812
QuickThread Manifold 2115407 micro seconds result = 653402.1206241812
QuickThread Distribute 1919942 micro seconds result = 653402.1206240761

TBB scan Scale: 6.321972798488777 resultSerial =  653402.1206241812 resultTBBscan = 660682.0901275538
 f error =0.002827559893142322 y error =-8.245543248852384e-05

TBB pipeline Scale: 2.965773537235055 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0

QuickThread Manifold Scale: 7.179342320413991 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0

QuickThread Distribute Scale: 7.910255101456189 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14

--------------------------------------------------------------------
N=5000000 W=1000
Serial               1858912 micro seconds result = 653402.1206241812
TBB scan              318373 micro seconds result = 656961.0825078025
TBB pipeline         2856357 micro seconds result = 653402.1206241812
QuickThread Manifold 2724758 micro seconds result = 653402.1206241812
QuickThread Distribute  259899 micro seconds result = 653402.1206240761

TBB scan Scale: 5.838786580520333 resultSerial =  653402.1206241812 resultTBBscan = 656961.0825078025
 f error =0.00137192822210194 y error =-1.938172739993035e-05

TBB pipeline Scale: 0.6507982020454726 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0

QuickThread Manifold Scale: 0.6822301283269927 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0

QuickThread Distribute Scale: 7.152439986302372 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14

-------------------------------------------------------------
N=5000000 W=100
Serial                510895 micro seconds result = 653402.1206241812
TBB scan              117841 micro seconds result = 668309.6256249581
TBB pipeline         1836910 micro seconds result = 653402.1206241812
QuickThread Manifold 2310255 micro seconds result = 653402.1206241812
QuickThread Distribute   81975 micro seconds result = 653402.1206240761

TBB scan Scale: 4.335460493376669 resultSerial =  653402.1206241812 resultTBBscan = 668309.6256249581
 f error =0.005840287488286404 y error =-0.0003632632077216882

TBB pipeline Scale: 0.278127398729388 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0

QuickThread Manifold Scale: 0.2211422548593121 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0

QuickThread Distribute Scale: 6.232326928941751 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
-------------------------------------
N=5000000 W=10
Serial                500001 micro seconds result = 653402.1206241812
TBB scan              109771 micro seconds result = 660682.0901275421
TBB pipeline         1850477 micro seconds result = 653402.1206241812
QuickThread Manifold 2604695 micro seconds result = 653402.1206241812
QuickThread Distribute   81658 micro seconds result = 653402.1206240761

TBB scan Scale: 4.554946206192892 resultSerial =  653402.1206241812 resultTBBscan = 660682.0901275421
 f error =0.002827559893142322 y error =-8.245543248006592e-05

TBB pipeline Scale: 0.2702011427323874 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0

QuickThread Manifold Scale: 0.1919614388632834 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0

QuickThread Distribute Scale: 6.12311102402704 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
-------------------------------------------------
N=5000000 W=0
Serial                498863 micro seconds result = 653402.1206241812
TBB scan              113384 micro seconds result = 668309.6256249611
TBB pipeline         1841656 micro seconds result = 653402.1206241812
QuickThread Manifold 2666495 micro seconds result = 653402.1206241812
QuickThread Distribute   81478 micro seconds result = 653402.1206240761

TBB scan Scale: 4.399765398998095 resultSerial =  653402.1206241812 resultTBBscan = 668309.6256249611
 f error =0.005840287488286404 y error =-0.0003632632077174592

TBB pipeline Scale: 0.2708774059867858 resultSerial =  653402.1206241812 resultTBBpipeline = 653402.1206241812
 f error =0 y error =0

QuickThread Manifold Scale: 0.1870856686399187 resultSerial =  653402.1206241812 resultQuickThread = 653402.1206241812
 f error =0 y error =0

QuickThread Distribute Scale: 6.122671150494612 resultSerial =  653402.1206241812 resultQuickThread2 = 653402.1206241812
 f error =0 y error =2.766447327871797e-14
[/bash]

The qt::parallel_distribute holds up well for scaling, and has an inconsequential degree of error (when you permit such errors).

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
771 Views

Oops, take the +1's off the Serial_scan_and_set (and comment). Correct line is:

StitchTable[iThread] = Serial_scan_and_set(Slice.iBegin, Slice.iEnd, func, f, y);

Jim Dempsey

0 Kudos
Reply