best way of convolution

makos999 · ‎12-17-2010

Hi Folks,

I've read the design patterns documentation of TBB produced by Intel, becuase I'm looking for the best way to implement a function which computes convolution very efficiently. What is written there not really clear for me unfortunately, so that I would like you to help me out with more code and examples.
In that documentation in section 3.: Elementwise shows the following code:

[cpp]tbb::parallel_for(
	tbb::blocked_range(0,xlen+clen-1,1000),
	[=]( tbb::blocked_range r ) {
		int end = r.end();
		for( int i=r.begin(); i!=end; ++i ) {
			float tmp = 0;
			for( int j=0; j*x[i-j];
			y = tmp;
		}
	}
	);

 [/cpp]

My problem is simple, I just don't know how to match the above idea with convolution and tbb's conventions, this is what i've written:

[cpp]// Type of data is IplImage*, it is now simplified for example.
//
void tbb_Convolution::operator()( tbb::blocked_range2d &R ) const
{
	int weight[3][3] = {{-1,  -1, -1}, {-1, 8, -1}, {-1,  -1, -1}};
	for( int x = R.rows().begin(); x < R.rows().end()-1; x++ )
	{
		for( int y = R.cols().begin(); y < R.cols().end() - 1; y++ )
		{
			int pixel_value = 0;
			for( int j = 0; j <= 2; j++ ) {
				for( int i = 0; i <= 2; i++ ) {
					pixel_value += weight * data;
				}
			}
			data = pixel_value;
		}
	}
}[/cpp]

Any help is appreciated, I think the code is more obvios than anything, now. Thanks in advance!

RafSchietekat · ‎12-17-2010

[cpp]int pixel_value = 0;   
for( int i = -1; i <= 1; ++i ) {   
    for( int j = -1; j <= 1; ++j ) {   
        pixel_value += weight[x-i][y-j] * data[i+1][j+1];   
    }   
}   
newdata = pixel_value;   
[/cpp]

Key ideas are that the destination must be different from the source in this naive implementation. I made some pedantic changes to make it more like a convolution perhaps, but with a symmetric pattern that shouldn't matter of course. Make sure to add protection for not reading outside the boundaries of the input.

A next step could be to selectively overwrite the input, keeping only boundary points between subrange areas until postprocessing (could also bescheduled with a parallel_reduce join, I suppose).

Does that make sense?

makos999 · ‎12-19-2010

Thanks for your reply!

It makes sense of course. First let's use this naive implementation...
The goal of my question would be to get an answer about how to make this design pattern to use in practice. Well, what is the purpose of the above mentioned paragraph in the documentation?! It wants to be really helpful what I can't turn into practice unfortunately or isn't there nothing worth to read in practice?

In the meantime an idea was presented in my mind: what if I have a picture in the memory, calculate the intervals which core has to calculate the corresponding part of the picture and let them to work on those independently. Another resulting space is also allocated in the memory. Than start number of threads as much as we have. So every core gets the same pointer in the memory and every core also gets the some space where they can write the results, and of course the offsets -to determine the part of the pics.
Thus this idea could realize a different desing pattern, where I could idependently modify the resulting algorithm of the picture, because I change the convolution to any kind of filter algorithm and I'll have the same parallelization pattern. I really would like to see a pattern like this!
Now, what is in your mind?
I might not be the gratest writer or presenter, please ask when something is not clear!

RafSchietekat · ‎12-20-2010

"I might not be the gratest writer or presenter, please ask when something is not clear!"
I'm afraid very little of it is. :-)

But betting against TBB's task-oriented approach is not generally a good idea, an exception being work like this that may benefit from the power of a GPU instead, with OpenCL a recommendable tool (or so I hear).

Alexey-Kukanov · ‎12-20-2010

Quoting makos999

The goal of my question would be to get an answer about how to make this design pattern to use in practice.

"Elementwise" is probably the simplest parallel pattern. Basically, it shows how to convert a serial loop with all independent iterationsinto TBB's parallel_for or parallel_do.

Honestly I do not understand what kind of concerns you have. The code you wrote seems reasonable to me, except for using the same array for input andoutputas Raf already noticed. Of course there should be a complete class definition and a call to parallel_for; is this what you have troubles with, or what?

makos999 · ‎12-20-2010

Alexey, you are right! I need some ideas, useful proof-of-concept examples, practically what is the good way of thinking here.
I implemented that algorithm what is mentioned before by me, and the running time is at least double than the sequential.
Well, could you give me design patterns to implement it effectively?

Alexey-Kukanov · ‎12-20-2010

Quoting makos999

I implemented that algorithm what is mentioned before by me, and the running time is at least double than the sequential.

Oh, you have performance issue. Is it with or without Raf's suggestion to use a different array for the new data? Can you post more code, better a full reproducing example?
I believe the pattern is right here. Most likely, bad performance has something to deal with compiler optimizations not applied to the parallel code. The forum has quite a few topics where similar issues were discussed.

RafSchietekat · ‎12-20-2010

Note that in #1 I forgot to name the other of the "key ideas" (using x and y in one of the arrays), and that I accidentally interchanged the arrays (it should be "pixel_value+= data[x-i][y-j]* weight[i+1][j+1];" instead at line 4). :-)

makos999 · ‎01-07-2011

Quoting Alexey Kukanov (Intel)

Quoting makos999
I implemented that algorithm what is mentioned before by me, and the running time is at least double than the sequential.

Oh, you have performance issue. Is it with or without Raf's suggestion to use a different array for the new data? Can you post more code, better a full reproducing example?
I believe the pattern is right here. Most likely, bad performance has something to deal with compiler optimizations not applied to the parallel code. The forum has quite a few topics where similar issues were discussed.

Hope this topic can still alive!

In the maintime I was reading and thinking about the problem and did some coding... So better to keep this topic hoping it gets some useful answer!
Now I don't use parallel_for() to perform the convolution, instead my program calculates ROIs on the picture, than starts threads (by using tbb:empty_task) on the same picture by every ROI.

I found the following discussion: http://software.intel.com/en-us/forums/showthread.php?t=64623&wapkw=(Architecture+Software+Developer)

Is that issue could meet with our issue? The mentioned picture in the memory is really a shared data.
That picture is read by OpenCV and the picture is splited up by using ROIs. In spite of using ROI the picture is still stored only once in the memory. Thus it must be shared data.

My consequence is that from the above discussion and from the measurements, I should store the pictures redundatly. Is it desired?

I don't care about much more memory, but this isse is only possible at the case of input pictures. I have to keep storing the output picture in one place. I would not like to interleave the results, because moving memory spaces is expensive, isn't it?

Sorry for not giving code, but in the next round I would post more if the above written stuff still not enough to give answer!

ARCH_R_Intel · ‎01-14-2011

General rule for shared memory programming is that if the data is shared but read-only, keep only one copy ofit, and do not use locks for it. If the data is modified, it is best to split it up into separate copies, let each task or thread modify its copy, and then merge the copies. For example, tbb::parallel_reduce and tbb::combinable offers two different ways to do this sort of split/merge. Which is best depends on context.

Whether moving data is cheap or expensive is relative to the cost of modifying the data. Sometimes moving the data or reorganizing it enables performance improvements that easily pay for the cost of moving it. For example, can improve performance sufficiently. Here is one example where it pays off.

Without more details on the algorithm, it's difficult to offer more specific advice.

RafSchietekat · ‎01-17-2011

Just to be sure, because I didn't find any mention of "grain" in this thread, except its use in the quoted example: try a few valuesof about100x100 (even with auto_partitioner, to avoid too much parallel overhead if you have many cores), or, if you write to a different buffer for output and don't need to minimise the border length, consider 1x10000 instead to optimise cache locality assuming row-major data layout (or just use a 1-dimensional block_range).

makos999 · ‎01-19-2011

The working data is only stored one time, of course. Now the actual implementation follows the pattern of LargeTaskInMain example. Do you know this sample? It does help if you know...
The entire source is short and simple here: http://pastebin.com/62KCwmns you can find it.
I changed TwiddleThumbs() to my own implementation (convolution).

Some measurements were made and I found that the effect of parallelization is the same for a small (512x512) picture and a big (20Mp) one. Parallel version almost runs twice as more times than the sequential. Is there some thread overhead or what, even at the case of 20Mp picture? -when the convolution (both the parallel and sequential) have run without repeating it any time.

The other test case is putting the convolution (both versions) into corresponding loops, than execute it 5000 times or 1000 times, does not matter, but at least 20 -gave efficiency. Measure the exection-time of loops, calculate avarge execution times and voila we can see the 92-95% parallel speed improvement - it would be perfect, but in the real usage of the algorithm I would not like to execute it so many times in my program.
Hope, it is clear enough.

This is really what we need to talk about in my opinion in this topic. And the following...

Furthermore the actual design pattern what I would like to use does not allow modifying the original convolution class. This is the main problem. Consequently I just can't put the operator()( blocked_range2d<> ) {} and simply use parallel_{for, reduce, etc.}.

I would like to ask you to give some ideas about the pattern and implementation and the reason of my measurement, mentioned above.