Is it possible to me to improve a simple O(n) algorithm with Cilk Plus?

Roní_G_ · ‎12-26-2014

Hi everybody!

I have a simple algorithm to print the very first unique ASCII character from a stream. In the worst case the algorithm scans through the stream twice, doing some comparisons and increments. So, roughly, my algorithm is O(n).

The algorithm⁰ do two main things:

It scans all the characters through the stream, counting the number of their occurrences with the help of an array. The character itself is used as the index of this array, since ASCII characters are numbers from 0 to 127 (or 255, if we consider its extended version);
After having filled the array that relates the characters in the stream with how many times they have appeared in it, the algorithm scans through the array, again, but this time it checks if the number of occurrence is equal to 1: if it is, then this is the first unique character; if not, it continues the search until the end of the stream is reached.

The first task runs¹ in linear time: a₁n + b₁. So as the second task: a₂n + b₂. Roughly, the whole algorithm runs in a₃n + b₃. Where a₃ = a₁ + a₂, and b₃ = b₁ + b₂.

My idea² is to try to improve its performance by splitting the first task in two using parallelism provided by Cilk Plus: I would have half of my stream being read by one function³ f₁(), and the other half by f₂() at the same time. With this approach, I would expect⁴ the first task to be executed, roughly, in (a₁n + b₁)/2, so that I could improve the overall performance of my algorithm even if it is just a little.

I would like to know if you think it is possible to accomplish that using Cilk Plus. If it is, I just would like to do it even if the speedup was worthless.

Cheers,

Roní.

===========================================================================================================

[0] I have attached my original code just in case I have not been clear enough in the explanation.

[1] I'm just a student and I'm not really experienced with calculating running times... However, I think I am superficially right about those claims.

[2] Again, the idea is simple and superficial. It has no rigor, by now.

[3] These two functions would operate in two different arrays to avoid competition and, then, in the end they would be joined. And I hope the time consumed by the joining the two arrays is small enough to make the splitting feasable.

[4] I know the improvement would be less than 2 because of many other factors. It is just a loose reasoning, I think.

Hansang_B_Intel · ‎12-29-2014

Hi,

If the problem size is known as you mentioned ("half" of the stream), it should be easy enough to make two Cilk workers (or N workers) process each half (chunk) of the stream. Another Cilk feature you may want to use is the reducer object which is a handy object to handle a reduce statement as in your main computation (counter increment); you don't have to use two separate arrays and join them in the end. A naive example code excerpt follows.

#include <cilk/cilk.h>
#include <cilk/reducer_opadd.h>
...
// Uses an array of additive reducer objects
cilk::reducer< cilk::op_add<int> > counter[ASCII_SIZE];
...

// Process a part of character stream defined by "offset" and "size"
void count_parallel(char *fileName, int offset, int size)
{
    // Open the file and seek "offset"
    for (int i = 0; i < size && (c = fgetc(fp)) != EOF; ++i)
        (*counter)++; // Increment the counter indexed by "c"
    // Close the file
}

int main(...)
{
    ...
    // Assume the problem size "N" was detected somehow (big assumption!)
    // CHUNK: size of character stream processed by each worker at a time
    cilk_for (int k = 0; k < N; k += CHUNK) {
        count_parallel(fileName, k, CHUNK);
    }
    ...
    // Accessing reducer's value
    if (counter.get_value() == 1)
        ...
    ...
}

As commented in the code, it is a big assumption that the problem size is known since it requires another round of scan (there might be some efficient ways to do this). More details about the reducer objects are available in the tutorial section of https://cilkplus.org. I did not try to measure the performance of this code, but it seems that the problem size should be big enough to compensate the parallelization overheads.

ARCH_R_Intel · ‎01-05-2015

Just so you don't get your hopes up :-), I suspect that it's going to be tough to get parallel speedup using any technique, because reading the file will dominate the time. My experience has been that reading sections of a file in parallel is slower than reading the file serially, probably because of the expense of random seeks versus serial streaming. Of course with an SSD or parallel file system it might be different.

The elements of asciiCountTable can count up to only 127 (or maybe 255 depending on whether "char" is unsigned) before wrapping around. For files of any significant length, the elements need to be some other type, such as Int, or the counting needs to be done differently. You really only need three values in the counter: 0, 1, and "big".

But since we're discussing parallelism, not necessarily practical speedup, here's a better way to attack the problem with only a single pass and is parallelizable. The trick is to think about reduction abstractly. Somehow we need to break the file into chunks, summarize each chunk, and then reduce the summaries to a single summary.

The summary to use is a table indexed by character, but the entries will not be counts, but instead denote one of three cases:

The character has not been seen yet. I'll denote this as 0.
The character has been seen once at index i. I'll denote this as "i".
The character has been seen more than once. I'll denote this as "inf".

Now suppose we split the file into two chunks and analyze each chunk separately, resulting in a table for each chunk. Then we can merge the tables into a single table applying the following "x" operator to corresponding elements:

0 x 0 -> 0 (did not see the character)
0 x i -> i (saw the character once, and we know its position)
i x 0 -> i (ditto)
any other case -> inf (saw character at least twice)

Then at the end, you can poke through the summary to find the minimum i entry.

The scans through each block need to be efficient. Suppose you have a 64-bit machine. You can declare the table as "unsigned long table[ASCII_SIZE]" and steal the upper two bits to mark the special cases.

0 can denote the 0 case.
i | (1<<62) can denote the i case
Any word with its high bit set denotes "inf"

Then the block scan can update the table after reading a character c using straightline code like this:

table = table<<1 | (1UL<<62) | i;

The upper two bits are essentially acting as a unary counter that sticks instead of overflowing. Rest of bits will be trash when it sticks, but then we don't care. The earlier "x" operation can now be implemented as a bitwise-OR. All this can be done by writing a custom reducer in Cilk Plus (which for time constraints I'll leave to someone else.)

- Arch