Parallel Error Diffusion

Tudor · ‎07-13-2009

Hello there,

My name is Tudor, I am from Romania, and I am new to these forums.
I am also quite new to the paradigm of parallel and multi-core programming (started studying about it one year ago).
In order to get a good start, I purchased 'Multi-core Programming" from Intel Press (an excellent book btw) and found the first implementation challenge: an error diffusion algoritm (page 42 for whoever owns the book, for whoever doesn't, Intel has conveniently released an excerpt with that part at http://www.intel.com/intelpress/excerpts/mcp1.htm).
I have attempted to implement the parallel algorithm as the book suggests, but have come across a problem: if thread A is processing the line directly above thread B, thread B waits until thread A has processed the first 2 pixels of its line. After this, thread B can start to process its line, then thread C and so on. The problem is, since threads get preempted, nobody guarantees that thread B won't go ahead of thread A at some point and process pixels it shouldn't (i.e. it doesn't lag behind A as it should). So, I have tried to implement a system to override the preemption, by preventing a thread from advancing if its progress (number of pixels already processed) is less by 1 than the progress of the previous thread. However, the algorithm still seems to fail at some point and I get different results compared to the sequential version, even though my console output says that the threads work correctly.

Sorry for the long rant. Here is the method I have implemented (C#):

[c-sharp]        private static int[][] ParallelErrorDiffusion(int[][] input, int height, int width)
        {
            int k;
            int[] progress = new int[height];

            Thread[] threads = new Thread[cpus];
            AutoResetEvent[] events = new AutoResetEvent[height];
            
            for (k = 0; k < height; k++)
            {
                events = new AutoResetEvent(false);
            }

            int[][] output = new int[height][];
            for (k = 0; k < height; k++)
            {
                output = new int[width];
            }

            for (k = 0; k < cpus; k++)
            {
                threads = new Thread(
                    param =>
                    {
                        int index = (int)param;

                        for (int i = index; i < height; i += cpus)
                        {
                            if (i > 0)
                            {
                                events.WaitOne();
                            }

                            for (int j = 0; j < width; j++)
                            {
                                if (i > 0 && j + 1 < width)
                                {
                                    while (progress + 1 == progress[i - 1])
                                    {
                                        Thread.Sleep(0);
                                    }
                                }

                                if (input < 128)
                                {
                                    output = 0;
                                }
                                else
                                {
                                    output = 1;
                                }

                                int error = input - 255 * output;

                                if (j + 1 < width)
                                {
                                    input[j + 1] += error * 7 / 16;
                                }

                                if (i + 1 < height && j - 1 > -1)
                                {
                                    input[i + 1][j - 1] += error * 3 / 16;
                                }

                                if (i + 1 < height)
                                {
                                    input[i + 1] += error * 5 / 16;
                                }

                                if (i + 1 < height && j + 1 < width)
                                {
                                    input[i + 1][j + 1] += error * 1 / 16;
                                }

                                progress = j + 1;

                                if (j == 2 && i + 1 < height)
                                {
                                    events[i + 1].Set();
                                }
                            }
                        }
                    });

                threads.Start(k);
            }

            for (k = 0; k < cpus; k++)
            {
                threads.Join();
            }

            return output;
        }
[/c-sharp]

Basically, I create a number of threads equal to twice the number of available cpus (the variable name cpus is misleading, as it is actually equal to 2 * Environment.ProcessorCount). Each thread method gets as parameter the starting line i and will process lines i, i + cpus, i + 2*cpus and so on.
Sorry for the long post, hope you guys understand what I wanted to say. Hope someone with more experience than me can figure out what I have done wrong. Thanks a lot.

Tudor

ClayB · ‎07-13-2009

Tudor -

I don't share your evaluation of the book and this example is one of the reasons. The description that is given for the parallelization of the error diffusion code is an example of awavefront algorithm. There's a lot of hand waving going on with some nice pictures, IMO, but no real details. As you've pointed out, without extreme coordination and synchronization (like lockstep execution), there is no way to ensure that a "wave" behind the current thread won't overtake the point of computation and generate erroneous results.

An example of how to set up the synchronization to do a correct wavefront with threads is given in Example 8-2 (along with two and a half pages of algorithmic analysis) inThe Art of Concurrency. The algorithm is for Bubblesort with threads proceeding in waves down the list of elements to be sorted and comparing keys of adjacent elements. To make sure no threads overtake a thread ahead of it in the list, "zones" of computation are set up and entry is done through a mutex object.When a thread comes to the next zone, it must acquire the mutex before it is allowed to enter. Once it has acquired that mutex, it releases the mutex of the current zone (in order to allow the trailing thread to enter the just traversed zone). You could try something like that to make sure your threads don't "lap" each other and ruin the results.

Of course, with all that synchronization overheadgoing on you may not get much in the way of performance.

I was very disappointed that the authors of Multi-Core Programming used an extremely complicated example without, apparently, implementing the solution and warning of the problems that are inherent in such a decomposition.

--clay

Tudor · ‎07-15-2009

Thanks for the reply Dr. Clay. Your idea is very interesting and I will try to implement it as soon I have some time.
In the meantime, I have fixed my own implementation. :D
Line 36 should now read:

[c-sharp]                                if (i > 0 && j + 2 < width)
                                {
                                    while (progress + 2 == progress[i - 1])
                                    {
                                        Thread.Sleep(0);
                                    }
                                }
[/c-sharp]

So, the next thread should wait 2 pixels behind the one above it (duh! that was the initial release condition anyway!).
Now the program produces correct results and is about 30% faster than the sequential version for a 5000x5000 picture.

jimdempseyatthecove · ‎07-15-2009

You might want to consider using >= in place of == as you may have an initial passissue as well as falling behind issue.

Jim Dempsey

Tudor · ‎07-15-2009

Quoting - jimdempseyatthecove

You might want to consider using >= in place of == as you may have an initial passissue as well as falling behind issue.

Jim Dempsey

Initial pass cannot happen because thread i waits on a conditional variable until thread i - 1 signals it to start. Or did I misunderstand what you said?

jimdempseyatthecove · ‎07-15-2009

Tudor,

What happens to performance when you set the number of threads to the number of available hardware threads?
When you have nThreads = 2*nHWthreads, when you have more software threads than hardware threads then every line you advance (in lock step, by i+ncpus) will incure a thread context switch (twice).

This may be bad or it may be good (depends on environment and length of time to run through the line). In a system that is loaded with other applications, the additional threads may (emphisis on may) may be able to do some work when one thread gets jambed up while waiting for another thread to advance (but was pre-empted for some other app thread to run). But this will come with the cost of unnecessary thread context switching (rather expensive when you figure in all those registers in context now).

Jim Dempsey

Tudor · ‎07-15-2009

Quoting - jimdempseyatthecove

Tudor,

What happens to performance when you set the number of threads to the number of available hardware threads?

Jim Dempsey

According to my tests, performance stays roughly the same.
I am running the program on a core 2 duo T5250 at 1.5 Ghz.
I assume that, given the fact that beside my 2 or 4 threads there are also about 870 other threads running in the system (Vista Business 64 bit), it doesn't make much difference?

Edit: I've tested the program on a Pentium 4 at 3 GHz with HT (Win XP SP3) and the parallel version seems to run 10% slower than the sequential version here. Any ideas why this is happening?

jimdempseyatthecove · ‎07-16-2009

Under the best of conditions, a one core HT system (P4), executing an FPU intensive program might yield a 10%-20% improvement. This is due to both threads in HT set sharing single floating point unit (and single SSE FP). Integer execution units are seperate on HT siblings. When one thread in HT set is performing a memory fetch/store is stalled for memory then the other HT sibling might get better access to the FPU/SSE.

This 10%-20% improvement is with long run parallel loops. When your parallel loops are shorter, the overhead to start/stop the threads will quickly eat into the 10%-20% improvement.

Jim Dempsey

Tudor · ‎07-16-2009

Quoting - jimdempseyatthecove

Under the best of conditions, a one core HT system (P4), executing an FPU intensive program might yield a 10%-20% improvement. This is due to both threads in HT set sharing single floating point unit (and single SSE FP). Integer execution units are seperate on HT siblings. When one thread in HT set is performing a memory fetch/store is stalled for memory then the other HT sibling might get better access to the FPU/SSE.

This 10%-20% improvement is with long run parallel loops. When your parallel loops are shorter, the overhead to start/stop the threads will quickly eat into the 10%-20% improvement.

Jim Dempsey

Thank you for the info. This forum is very helpful. :)
I was able to squeeze an additional 10% improvement on my dual core by setting process priority to real time and enabling priority boost (properties of the Process class in C#).

A few tests with the algorithm I have posted above:

3000 x 2000
sequential: 1120 ms
parallel: 780 ms

5000 x 5000
sequential: 4446 ms
parallel: 3229 ms

5000 x 7000
sequential: 6224 ms
parallel: 4539 ms

7000 x 7000
sequential: 8860 ms
parallel: 6333 ms

I cannot run tests on larger arrays because of insufficient RAM to store all the arrays. I am considering purchasing a quad-core desktop system with 4 GB of RAM for the sole purpose of testing my multicore apps. Is it worth it? I am really interested in learning more about multicore programming and the first step will be to do my bachelor project on this topic.

jimdempseyatthecove · ‎07-17-2009

I would suggest you get a Core i7 4-core with HT. If you are tight on budget, consider the Core i7 920 as opposed to the Q6600. I have a Q6600 (4 core no HT), have had it for some time, it is good, but the Core i7 920 which is available now and is internally faster and has HT is much better. The benefit of you having HT is not so much for you to get performance, but as a leaning tool as to how best program for systems with and without HT. Having a Q6600 would not provide for the HT experience. I am not sure where you are located. In the U.S. you can mail-order a Q6600 for $200 and a Core i7 920 for $280. (mother board and RAM would be a seperate issue).
The $80 difference gets you a lot of bang for your buck (good return of value for little incremental cost).

Jim Dempsey

Tudor · ‎07-20-2009

Quoting - Clay Breshears (Intel)

Tudor -

An example of how to set up the synchronization to do a correct wavefront with threads is given in Example 8-2 (along with two and a half pages of algorithmic analysis) inThe Art of Concurrency. The algorithm is for Bubblesort with threads proceeding in waves down the list of elements to be sorted and comparing keys of adjacent elements. To make sure no threads overtake a thread ahead of it in the list, "zones" of computation are set up and entry is done through a mutex object.When a thread comes to the next zone, it must acquire the mutex before it is allowed to enter. Once it has acquired that mutex, it releases the mutex of the current zone (in order to allow the trailing thread to enter the just traversed zone). You could try something like that to make sure your threads don't "lap" each other and ruin the results.

--clay

Dr. Clay, I have to admit, this idea was brilliant. I have just implemented it on the HT machine and it blows my version away by 20%. It also runs faster than the sequential version on the HT machine, whereas mine runs slower. When I get home I'm going to also test it on the core 2 duo machine.

Edit: 60% - 65% improvement on the core 2 duo compared to the sequential version! This is fantastic!

With large enough computation zones, intense context switching is avoided, so performance is dramatically increased.