Re: image processing doesnt scale well

nagy · ‎04-11-2009

im writing a function for efficiently blending two images with premultiplied alpha...

ive optimized it for SSE and tried to make it cache efficient... and now im trying to make it even faster by splitting up the work between cores with TBB...

however my code barely scales... it gets only 10-20% faster with 2 cores and scales even worse the more cores u add...

we do processes several images at the same time... but there is a limit on how much we can do this which is why i would like to make the actual "processing" scale better...

id appreciate any suggestions on how i can improve this... im just using a parallel_for and affinity_partitioner...

[cpp]void TestPreMLerpSSETBB(unsigned char* pDest, unsigned char* pSrc, int count)
{
	//
	// D = D + S - (D*S)/255
	//

	class Blend
	{
	public:

		Blend(__m128i*  dest, __m128i* source) : dest_(dest), source_(source)
		{}
		void operator()(const tbb::blocked_range& r) const
		{
			// TODO: optimize prefetch schedluing distance
			static const uint32 PSD = 256;
			__m128i d = _mm_setzero_si128();
			__m128i s = _mm_setzero_si128();
			__m128i a = _mm_setzero_si128();

	...





























*S)
				rb = _mm_srli_epi16(rb, 8);		// prepack and div [(D*S)]/255

				// TODO: check assembly for prefetch
				_mm_prefetch(reinterpret_cast(&dest_[n+PSD]) , _MM_HINT_NTA); // prefetch between instructions 		

				ag = _mm_srli_epi16(s, 8); 		// unpack
				ag = _mm_mullo_epi16(ag, a);	// mul (D*S)
				ag = _mm_and_si128(ag, himask);	// prepack and div [(D*S)]/255
						
				rb = _mm_or_si128(rb, ag);	// pack
						
				rb = _mm_sub_epi8(s, rb);	// sub S-[(D*S)/255]
				d = _mm_add_epi8(d, rb);		// add D+[S-(D*S)/255]

				_mm_store_si128(&dest_, d);

				// UNROLL 2 //

				d = _mm_load_si128(&dest_[n+1]);
				s = _mm_load_si128(&source_[n+1]);
				
				// set alpha to lo16 from dest_
				rb = _mm_srli_epi32(d, 24);
				a = _mm_slli_epi32(rb, 16);
				a = _mm_or_si128(rb, a);

				// fix alpha a = a > 127 ? a+1 : a
				rb = _mm_srli_epi16(a, 7);
				a = _mm_add_epi16(...


*S)
				rb = _mm_srli_epi16(rb, 8);		// prepack and div [(D*S)]/255

				ag = _mm_srli_epi16(s, 8); 		// unpack
				ag = _mm_mullo_epi16(ag, a);	// mul (D*S)
				ag = _mm_and_si128(ag, himask);	// prepack and div [(D*S)]/255
						
				rb = _mm_or_si128(rb, ag);	// pack
						
				rb = _mm_sub_epi8(s, rb);	// sub S-[(D*S)/255]
				d = _mm_add_epi8(d, rb);		// add D+[S-(D*S)/255]

				_mm_store_si128(&dest_[n+1], d);		

				// UNROLL 3 //

				d = _mm_load_si128(&dest_[n+2]);
				s = _mm_load_si128(&source_[n+2]);

				// set alpha to lo16 from dest_
				rb = _mm_srli_epi32(d, 24);
				a = _mm_slli_epi32(rb, 16);
				a = _mm_or_si128(rb, a);

				// fix alpha a = a > 127 ? a+1 : a
				rb = _mm_srli_epi16(a, 7);
				a = _mm_add_epi...


*S)
				rb = _mm_srli_epi16(rb, 8);		// prepack and div [(D*S)]/255

				_mm_prefetch(reinterpret_cast(&source_[n+PSD]) , _MM_HINT_NTA);

				ag = _mm_srli_epi16(s, 8); 		// unpack
				ag = _mm_mullo_epi16(ag, a);	// mul (D*S)
				ag = _mm_and_si128(ag, himask);	// prepack and div [(D*S)]/255
						
				rb = _mm_or_si128(rb, ag);	// pack
						
				rb = _mm_sub_epi8(s, rb);	// sub S-[(D*S)/255]
				d = _mm_add_epi8(d, rb);		// add D+[S-(D*S)/255]

				_mm_store_si128(&dest_[n+2], d);

				// UNROLL 4 //

				d = _mm_load_si128(&dest_[n+3]);
				s = _mm_load_si128(&source_[n+3]);
				
				// set alpha to lo16 from dest_
				rb = _mm_srli_epi32(d, 24);
				a = _mm_slli_epi32(rb, 16);
				a = _mm_or_si128(rb, a);

				// fix alpha a = a > 127 ? a+1 : a
				rb = _mm_srli_epi16(a, 7);
				a = _mm_add_e...


*S)
				rb = _mm_srli_epi16(rb, 8);		// prepack and div [(D*S)]/255

				ag = _mm_srli_epi16(s, 8); 		// unpack
				ag = _mm_mullo_epi16(ag, a);	// mul (D*S)
				ag = _mm_and_si128(ag, himask);	// prepack and div [(D*S)]/255
						
				rb = _mm_or_si128(rb, ag);	// pack
						
				rb = _mm_sub_epi8(s, rb);	// sub S-[(D*S)/255]
				d = _mm_add_epi8(d, rb);		// add D+[S-(D*S)/255]

				_mm_store_si128(&dest_[n+3], d);
			}				
		}

	private:

		__m128i* dest_;
		__m128i* source_;

	};

	__m128i* dest = reinterpret_cast<__m128i*>(pDest);
	__m128i* source = reinterpret_cast<__m128i*>(pSrc);

	static tbb::affinity_partitioner ap;
	tbb::parallel_for(tbb::blocked_range(0, (count/sizeof(__m128i))/4), Blend(dest, source), ap);
}
[/cpp]

as a side note which doesnt have anything to do with TBB. Id like to have my prefetches between the calculations and not right after the load instructions... but the compiler generated assembly puts the prefetch right after the loads... is there any way i can tell the compiler not to do this?

RafSchietekat · ‎04-11-2009

It seems like this should scale perfectly if access to memory can keep up (you haven't exhausted anything with outer-loop parallelism yet, have you?); and I'm assuming that this is all the processing there is to do on the images (otherwise the different stages should be moved inside the loop). I haven't done any SSE coding, so I'm ignoring that part of it, but I think I picked up somewhere that the compiler would gain more optimisation confidence with something like "for(int k = r.begin(), k_end = r.end(); k != k_end; ++k)" instead; I also would use a macro or perhaps template metaprogramming for the unrolling, because I have a phobia about explicit redundancy (so much to read, so many opportunities to make a mistake), and because it makes for easier experimentation about the level of unrolling (perhaps unrolling is superfluous with the rewritten loop header?). But principally I would use auto_partitioner instead, because this code does not seem to fit what affinity_partitioner is supposed to do (hmm, how does it do that?), and perhaps even simple_partitioner (to really have a handle on what's going on). Does that do anything for you?

nagy · ‎04-12-2009

Quoting - Raf Schietekat

It seems like this should scale perfectly if access to memory can keep up (you haven't exhausted anything with outer-loop parallelism yet, have you?); and I'm assuming that this is all the processing there is to do on the images (otherwise the different stages should be moved inside the loop). I haven't done any SSE coding, so I'm ignoring that part of it, but I think I picked up somewhere that the compiler would gain more optimisation confidence with something like "for(int k = r.begin(), k_end = r.end(); k != k_end; ++k)" instead; I also would use a macro or perhaps template metaprogramming for the unrolling, because I have a phobia about explicit redundancy (so much to read, so many opportunities to make a mistake), and because it makes for easier experimentation about the level of unrolling (perhaps unrolling is superfluous with the rewritten loop header?). But principally I would use auto_partitioner instead, because this code does not seem to fit what affinity_partitioner is supposed to do (hmm, how does it do that?), and perhaps even simple_partitioner (to really have a handle on what's going on). Does that do anything for you?

that did help a little... but not much... im not doing anything outside of the function when testing

im doing the testing like this:

[cpp]	double time = 0.0;

	_asm emms;	
	for(int i=0;i	{
		InitData(pDest, pSrc);
		for(int n = 0; n < dataSize()/64; ++n) // flush cache
		{
			_mm_clflush(pDest+n*64);
			_mm_clflush(pSrc+n*64);
		}
		global_pTimer->GetTimespan();
		(*pFn)(pDest, pSrc, count);
		time += global_pTimer->GetTimespan();
	}
	_asm emms;

	return time/(double)TestCount;[/cpp]

i believe you are right when suspecting that the memory might not be keeping up... im wondering if it would be possible to have the different cores start the tasks at different times? (some delay before beginning)... this maybe would make the memory access interleaved between the cores? so that while one core is doing calculations the other access memory... is there anyway to express this in tbb?

metatemplates for unrolling sounds like a great idea for future reference... in this case i want the manual unlooping so that i can place the prefetching more optimal

RafSchietekat · ‎04-12-2009

"that did help a little... but not much..."
:-) But did you also try a different partitioner (the rest of what I wrote was mainly filler material)?

"i believe you are right when suspecting that the memory might not be keeping up..."
That would be if you were combining inter-image concurrency with intra-image concurrency (I have no idea how much bandwidth you are using or what would be the limit), as would seem appropriate, but I see no reason for saturation only inside an image, as in the test code, if it does not occur across images (or does it?).

"im wondering if it would be possible to have the different cores start the tasks at different times? (some delay before beginning)... this maybe would make the memory access interleaved between the cores? so that while one core is doing calculations the other access memory... is there anyway to express this in tbb?"
If you really think that it might be beneficial to stagger tasks like this, I think you'll need to bring your own tricks.

"metatemplates for unrolling sounds like a great idea for future reference... in this case i want the manual unlooping so that i can place the prefetching more optimal"
As I wrote, I'm not familiar with SSE programming, but it wouldn't worry me too much if the prefetches are right after the loads, because it would seem that prefetch instructions have to get whatever lead time they can to be effective.

So, what about auto_partitioner?

nagy · ‎04-12-2009

":-) But did you also try a different partitioner (the rest of what I wrote was mainly filler material)?"

ive tried auto_partitioner and it seems to work a bit better im getting around 30% performance increase vs single threaded
i read the documentation for the partitioners and it makes sense that auto_partitioner is what should be used...

although my test results varies somewhat between runs... how can i improve the testing?

the initdata doesnt do anything more than setting the first few values in the frames to specific values so that i can make sure it does what it should...

im flushing the cache between each run soo that nothing is preloaded in the cache...

"That would be if you were combining inter-image concurrency with intra-image concurrency (I have no idea how much bandwidth you are using or what would be the limit), as would seem appropriate, but I see no reason for saturation only inside an image, as in the test code, if it does not occur across images (or does it?)."

u kinda lost me here... it sounds very interesting but i dont understand... my english isnt that good... could you please explain in more detail?

"If you really think that it might be beneficial to stagger tasks like this, I think you'll need to bring your own tricks."

i have no idea if it would be benificial... im just throwing around ideas...

"As I wrote, I'm not familiar with SSE programming, but it wouldn't worry me too much if the prefetches are right after the loads, because it would seem that prefetch instructions have to get whatever lead time they can to be effective."

youve convinced me... changed to meta templates

im still a bit suprised that this doesnt scale better... seems to me like something that should scale very close to the actual number physical threads (cores)

RafSchietekat · ‎04-12-2009

"ive tried auto_partitioner and it seems to work a bit better"
Sorry, I was hoping that it would work a lot better. I'm out of ideas now.

"u kinda lost me here... it sounds very interesting but i dont understand... my english isnt that good... could you please explain in more detail?"
Let's start with: how much scalability do you get from running multiple single-threaded processes (it should be comparable)?

AJ13 · ‎04-12-2009

Have you profiled to see where the time is really being spent?

AJ

nagy · ‎04-12-2009

Quoting - AJ

Have you profiled to see where the time is really being spent?

AJ

i did profile it... but it just shows how much time is spent in the actual blending function... no details...

ive also run thread profiler which shows full utilization of both cores... since it doesnt scale more than 20-30% the only explanation i can think of is that there is an overhead somewhere or the memory cant keep up...

i havnt tried running "running multiple single-threaded processes" yet...

i guess my only question is if tbb has any possibilities to tune memory usage..

AJ13 · ‎04-12-2009

Is it possible for you to give a testcase that we can tinker with to see if we can make it faster?

AJ

nagy · ‎04-13-2009

Quoting - AJ

Is it possible for you to give a testcase that we can tinker with to see if we can make it faster?

AJ

sure...

robert_jay_gould · ‎04-13-2009

Quoting - nagy

sure...

The link worked on the first try incredible!

nagy · ‎04-13-2009

Quoting - robert.jay.gould

The link worked on the first try incredible!

why is that incredible?

robert_jay_gould · ‎04-13-2009

Quoting - nagy

why is that incredible?

Sorry for off-tracking, but it's the first time I can remember someone actually getting the link to work the first time, 99% of the time people (as in everyone) fails to get their file linked on the first attempt. So it's just kind of unexpected you got it to work on the first try :)