setting and scanning an array on one go

Walter_D_ · ‎06-18-2013

Hi I'd like to parallelise the following simple serial code:

tmp = Id_x

for(i=0; i<n; ++i) {

y = tmp;

f = func(i); // func(i) is expensive and touching f may be expensive, too

tmp = tmp x f; // x is some associative operation

}

and I want to call func(i) exactly once for each i. I could use parallel_for() to set f and thereafter parallel_scan() to set y. This touches most f three times (in parallel_for(), and in the pre- and final scans of parallel_scan()). But, algorithmically 2 touches suffice. So, I wonder whether it's possible to implement this code in parallel using just parallel_scan()?

I tried, and it's not working. I presume it's impossible, because some i are scanned twice (pre-scan and final scan), but not all. The implementation of the scans (body::operator()) would need to know whether a given final scan has been preceeded by a pre-scan or not. While parallel_scan() must have that information, it appears that it is impossible to obtain it for the implementation of body::operator().

Is my analysis correct, i.e. there is no way to implement my simple code without touching most f thrice (as opposed to twice)?

NOTE: using an array of flags or pre-setting f to a value guaranteed not to occur from func(i) is not an acceptable solution (they body require touching each element of an array).

Walter_D_ · ‎06-18-2013

I just figured that it may work after all. The required information seems available via the body::reverse_join(). A final scan which was not pre-ceeded by its body beeing reverse_join()ed (by another body) has not been preceeded by a pre-scan. Is this correct?

I implemented it (using a boolean member 'joined' of body, which defaults to false and is set to true if body get's reverse_join()ed) and it does work. If verified, perhaps this example could be added to the examples for parallel_scan() in the docu (would have saved me a lot of trouble)

jimdempseyatthecove · ‎06-18-2013

Try something like this:

[cpp]
tmp = Idx
atomic<int> i_y;
atomic<int> i_f;
i_y = 0; // initialize before parallel_invoke
i_x = 0;
parallel_invoke(
[](){
    for(; i_y < n; ++i_y)
    {
      while(i_y == i_f)
        _mm_pause();
      y[i_y] = tmp;
    }},
[](){
    for(; i_f < n; ++i_f)
    {
      f[i_f] = func(i_f); // func(i_f) is expensive and touching f[i_f] may be expensive, too
      while(i_y < i_f)
        _mm_pause();
      tmp = tmp x f[i_f];   // x is some associative operation
    }});
[/cpp]

Jim Dempsey

jimdempseyatthecove · ‎06-18-2013

Alternet method:

[cpp]
tmp = Idx
atomic<int> i_y;
atomic<int> i_f;
i_y = 0; // initialize before parallel_invoke
i_x = 0;
parallel_invoke(
[](){
    y[i_y++] = tmp; // y[0]
    for(; i_y < n; ++i_y)
    {
      while(i_y == i_f)
        _mm_pause();
      tmp = tmp x f;   // x is some associative operation
      y[i_y] = tmp;
    }},
[](){
    for(; i_f < n; ++i_f)
    {
      f[i_f] = func(i_f); // func(i_f) is expensive and touching f[i_f] may be expensive, too
      }});
[/cpp]

Jim Dempsey

RafSchietekat · ‎06-18-2013

Walter D. wrote:

I just figured that it may work after all. The required information seems available via the body::reverse_join(). A final scan which was not pre-ceeded by its body beeing reverse_join()ed (by another body) has not been preceeded by a pre-scan. Is this correct?

I implemented it (using a boolean member 'joined' of body, which defaults to false and is set to true if body get's reverse_join()ed) and it does work. If verified, perhaps this example could be added to the examples for parallel_scan() in the docu (would have saved me a lot of trouble)

The scan operator has a parameter than is either pre_scan_tag or final_scan_tag, so there's no need to guess. If func is expensive, you should consider caching its result between pre-scan and final scan.

jimdempseyatthecove · ‎06-18-2013

On the two suggestions I made, they both require the availability of two threads. To be safe you should add the equivilent of a spinwait, such that the active thread can fall back to single thread mode.

Jim Dempsey

Walter_D_ · ‎06-18-2013

Jim: I want to avoid an atomic index and run efficiently on more than 2 threads. tbb::parallel_scan() obtains a scaling of 50% for large n_proc, i.e. takes 2/n_proc as long as serial code.

Raf: you completely missed the whole point. Perhaps it would be useful if you read posts more carefully.

Of course I'm caching f=func(i) as in the serial code. But this is more complicated than you thought: it cannot simply by done in the pre-scan. As explained in my original post, the reason is that for some i no pre-scan is made. These are those at the very begin and very end of the total range. For these, the call f=func(i) must be done in the final (and only) scan. So, obviously, the information pre-scan vs. final scan is not sufficient. What is required in addition is to know whether a final scan was preceeded by a pre-scan or not. I found a way to obtain that information. See also the attached code.

RafSchietekat · ‎06-18-2013

I absolutely did miss the point. To avoid offending again, I promise never to answer to your questions anymore.

jimdempseyatthecove · ‎06-18-2013

Actually you do not need to make these variables atomic. Just declare them with volatile.
atomic was used due to the fact that people on this forum are adverse to using volatile.

Your original post had

f = func(i); // func(i) is expensive and touching f may be expensive, too

An (2) atomic++ should be irrelivant. The volatile might take a couple clock ticks extra per iteration.

Jim Dempsey

RafSchietekat · ‎06-19-2013

For clarity, the flags should be kept separate, probably as an array of bytes, one for every 100 or 1000 elements or so.

About volatile not being more efficient than atomic: that could also be read as atomics not being more expensive than volatile (with the right precautions about avoiding unnecessary read-modify-write). I think there should be further semantics "notrans" to avoid the cost of read-modify-write atomicity where it is not needed, without having to resort to spelling it all out.

(Correction) Well, that would be orthogonal to memory semantics, you might still want to release.

RafSchietekat · ‎06-19-2013

After a private message, I've had another look. Both workarounds I would think of have been addressed. The one with the flags could be enhanced for less overhead by grouping a block of elements under one flag, and enforcing alignment by iterating over blocks instead of over individual elements; flags can be initialised in bulk using memset() (even atomics, if needed, preferably only before initiating concurrent access).

I'm not confident that the proposed rule would be generally applicable: what if a Body does a reverse join, a final scan of a range that had a prescan, and then proceeds off to the right? One would have to prove that this is fundamentally impossible, or ask Intel never to change the current algorithm if it happens to behave like that, because I don't remember seeing any guarantee about that, and if it were that reliable, why wouldn't there be an only_scan_tag already? This may have been discussed before, perhaps in a thread with the participation of TBB's original architect, but I don't remember all the details, and I can only hope that I might have remembered if a solid solution was suggested; still, you might want to try searching for that thread, which at least resulted in the addition of some nice graphics for better understanding of the algorithm (both its mechanism and its range applicability).

I have to disagree about using volatile where an atomic should be used: there's no gain in performance (if you don't need atomic ++ you can split up the operation yourself), and it relies on implementation details that are not technically guaranteed (even if the increment does not have to be atomic, at least the load and store operations should be). But I tend to be pedantic about such matters.

Walter_D_ · ‎06-20-2013

I'm not confident that the proposed rule would be generally applicable: what if a Body does a reverse join, a final scan of a range that had a prescan, and then proceeds off to the right?

No problem, since the body has not been reverse_join()ed by another.

why wouldn't there be an only_scan_tag already?

Because the need for such a tag was not recognised. For a typical scan operation, there is no need to distinguish final scans preceeded or not by a pre-scan. In fact, it is not possible, with the current implementation of the algorithm (using compile time information passing), to provide the only_scan information in such a way that code which doesn't use that information would continue to work.

RafSchietekat · ‎06-20-2013

I don't know the facts well enough to do more than argue about plausibility. If you want to use this, you'll have to prove it, and know whether it depends on essential properties or just on current implementation. If you can communicate that proof, others may be able to benefit, and perhaps it can be incorporated in the documentation. I'm not going to compete to be the first to provide such proof, but maybe if you get stuck I might be tempted to have a go at it myself.

jimdempseyatthecove · ‎06-20-2013

Walter,

Have you put together a working benchmark with dummy type (T) and functor (Func) representative of your array sizes and computational loads? The code should have the serial version and parallel version using the same input data but generating separate output data which could then have an untimed verification pass. The Type you supply should be a mock-up of, but also similar complexity of, and size of the objects you intend to use. This can be dummy code but functional enough to support the operators used in your struct body (=, +=, [], ...).

If so would you mind uploading the program?

RafSchietekat · ‎06-20-2013

While looking at the pretty picture for how parallel_scan operates, I was wondering whether it would also be possible to provide better affinity between the two phases, because clearly they are mostly if not always executed by separate Body instances, and presumably a Body instance now stays with the same thread during its lifetime. The parallel-scan implementation reminds me of how complexity can be added to separately predict carry information out of a group of binary digits during an addition, which improves latency for arriving at the result. In this case the circuitry is predetermined and an algorithm has to be devised to make the best use of what already exists. Maybe affinity is just mutually exclusive with that goal, maybe it isn't, but this should be explored.

It would also be interesting to have a survey of all the possible use cases. Here we have one where a guarantee of either only_scan_tag or pre_scan_tag+final_scan_tag is rather crucial, and even one of just always pre_scan_tag+final_scan_tag would be preferable to final_scan_tag or pre_scan_tag+final_scan_tag (does parallel_scan at least provide that guarantee, i.e., no element to be visited more often than during one possible pre-scan and one mandatory final scan?). Perhaps this information can be extracted out of the current implementation, as Walter surmised, or otherwise another trick can be used behind the scenes to implement a new operation with just that guarantee, so that there is no need to mix those concerns in an application.

Any ideas about this?

jimdempseyatthecove · ‎06-21-2013

I have rewritten your struct body, had to make some edits to get it to work. The test program shows a descrepancy in the results.

Walter, Raf, can you please check the code for errors.

Changes:

T is provide as type double
arrays f and y are now fTTB and and yTBB, and fSerial and ySerial
The argument to the supplied function changed from int to double*
The two calls functor args changed from int to &f
The "expensive and touching f" function is a delay loop returning a simple modification of f
(you can alter the contents for delay time or computational overhead)

The test program allocates arrays, initializes the TBB and Serial arrays to same value,
Then performs (assumed) same process.
Timing is made, output arrays are verified, output sum is shown

Tests indicate errors along the line of what Raf expressed as his concernes.

I believe at issue here is the partitions TBB starts with all begin with sum=0.0, rather than sum from prior partition.
The output arrays generated for each TBB partition will start with this sum=0.0 as opposed to the resultant sum from the equivilent of the prior partition in the serial run. Note, although the TBB reduction operation accumulates each partition's sum, the sum's are incorrect.

Jim Dempsey

jimdempseyatthecove · ‎06-22-2013

While we are waiting for Walter (or Raf) to comment/fix the failing TBB implementation of the derivative of Walter's program, I thought this would be an interesting problem to address with the QuickThread Parallel Manifold. The program is attached. You will be able to see the coding, it is a little less intrusive that tbb::pipeline and qt::parallel_pipeline (neither used in test program). You won't be able to run it without downloading the QuickThread code from my website.

Command window output:

[plain]
N=5000000

Serial test
TBB test
QuickThread test

TBB Scale: 4.909084
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial != yTBB
fSerial != fTBB

QuickThread Scale: 6.786223
resultSerial = 653402.120624, resultQuickThread = 653402.120624
[/plain]

The test was run on Core i7 2600K, 4 core with HT

Note, the TBB test failed to produce the same results as the serial code (waiting for fix from Walter or Raf)
The runtime for the TBB test would be representative of what the fixed code may perform using this program construct in TBB. TBB pipeline should be written using a similar technique as I did with two-stage Parallel Manifold. The TBB pipeline may then produce correct output.

The QuickThread Parallel Manifold produces the same output as the serial run and is 6.786x faster than serial and ~38% faster than TBB.
This was one run with the same synthetic load function for all tests. The actual performance differences will vary with this function (the synthetic load function stalls for 10,000 _rdtc() clock ticks).

Jim Dempsey

RafSchietekat · ‎06-22-2013

I would have preferred that program not to have been so very needlessly Windows-specific, because I had to first make it portable before being able to try it out for myself (I didn't see anything wrong with the TBB side of it from just looking at the code, well, no problem that would make it fail to compute the running sum, anyway). I had no interest in timing, so I just took out those statements.

From the Reference Manual: "Operations such as floating-point addition that are somewhat associative can be used, with the understanding that the results may be rounded differently depending upon the association used by parallel_scan. The reassociation may differ between runs even on the same machine. However, if there are no worker threads available, execution associates identically to the serial form shown at the beginning of this section."

I didn't analyse how unstable the calculation really is, but "resultSerial = 430605.730309, resultTBB = 426995.560837" still seems to be in the same ballpark, so TBB appears to be working as it should. Still, those excessively demanding comparisons were even failing when I ran the program with one thread and got "resultSerial = 430605.730309, resultTBB = 430605.730309", but I didn't investigate why.

Walter_D_ · ‎06-23-2013

okay, I transcribed Jim's hack into a C++ program, see attached. I works correctly immediately (apart from round-off in the 14th significant digit; Jim: you confused the arguments y and f to the body for the parallel_scan()).

When issuing a wait in the function call (2nd argument to main), the scale factor is near to the number of processors, i.e. near ideal (using gcc 4.8.1 or clang 3.2) on a i7 (4 cores) intel chip.

Raf, I thought a little about prooving that my implementation works, but with no result yet (though I have never yet found an error either).

jimdempseyatthecove · ‎06-23-2013

Raf,

I added a TBB pipeline.

*** I tried to construct it for two stages, had issues with trying to get initialized void* comming into the input stage (not part of TBB design). Maybe you will have better luck. two stage should operate faster (as indicated by my two-stage manifold runs).

[plain]
N=5000000

Serial test
TBB test
TBB pipeline test
QuickThread test

TBB Scale: 4.991204
resultSerial = 653402.120624, resultTBB = 668309.625625
ySerial != yTBB
fSerial != fTBB

TBB pipeline Scale: 2.923964
resultSerial = 653402.120624, resultTBBpipeline = 653402.120624

QuickThread Scale: 6.653138
resultSerial = 653402.120624, resultQuickThread = 653402.120624
[/plain]

Jim Dempsey

RafSchietekat · ‎06-23-2013

How do you get a pipeline... that must be one seriously expensive map!