How to speed up this code?

Alexander_L_1 · ‎01-17-2017

Hello together,

many thanks for all contributors to my past question.

Crazy things happens, 2 years ago I was internally moved to UI & Communication development to speed up that things :) So my last knowledge is, .. ehm not very actual.

For today, algorithmic reqirements and volume of data to process growed up massively, so I moved back to "hard stuff development" back again for some days. And it seem like I need a lot of help.

First of all, target production maschine is either I5 or I7 or Xenon and we need high optimization for I7 or Xenon. I7 has dual channel memory, Xenon has quad channel memory. Currently I try to optimize for Xenon. The spezification is attached as a file.

What the algorithm(s) does:

There are originally two algorithms. First algorithm had ARGB-Image and spezial map input and ARGB-Image output with some spezified width, height and stride. This algorithm simply does an 3x3 average blur and does also one speziality - the output has blured value if map has 0xFF and original value if not. The second algorithm had output of first algorithm as their input and produces planar 4 output channels - red, green and blue; this also calculates the 4.th "gray" channel from R, G, B with a some math (not a simply) average.

Both algorithms were 2x parallelilized. First feed process upper half (in memory) of data, swcond feed process other half. That is approx. 2x faster as only one feed.

What is already done:

1. I've moved from SSE (128 bit) to AVX2 (256 bit) in a hope to use full bandwith of quad channel memory - the speedup was extremelly small and not measurable.
2. I've tried 4x parallelization - the same thing happens - no speed up and sometimes it consumes more times as 2x parallelization.
3. I've combined both algorithms together in order to save one memory-READ operation from ptotentially not (more) cached area. This is only speedup of 1-2%, mostly not measurable.
4. Experimented with lot of other things - cached read/uncached write (slower execution) and vice versa, some prefetch, "time derefered write" (see switch case in the code) and some other gimmicks. Write 32-bit values or pack together until 128 bit register is full and write out, etc. That all not does not helped.

Summarized I must say, I spent a lot of time and does not made significantly improvement. But I found interested thing. Originally both algorithms taken approx. 550 µs for a first (blur & map) algorithm and approx 180 µs for a second (split channels) algorithm., the combined algorithm takes approx. 1080 µs - this is surprisingly a lot slower (inside a programm which does more and has more as one thread!). The most of time is consumed to write data to memory with planar data. If no data will be written, the time consumption reduces to approx 560 µs and grows up by 120 µs for each channel writte. If this penalty can be avoided, that would helps a lot.

The main part (blur) of algorithm process two consecutive lines at once in order to reuse added values and save the read operation from memory that may be out of cache, to the second part (split) also writes two lines oot on each planar channel.

Here is the code:

	//AL: (03.02.2014) ==> ...
	void fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core_SSE4(int width, int height, int step, const int topLine, const int xReg,
		const unsigned long long rowMask, const uint8* const map, const uint8* const in, uint8* const blurred,
		// Splitting
		const void* const gray, const void* const red, const void* const green,
		const void* const blue, const short redWeight256, const short greenWeight256, const short blueWeight256
	)
	{
		// step is stride in bytes!
		// xReg is in bytes (not pixels)!

		// Bluring and mapping
		int pixStride = step >> 2; // convert int8* to __int32* stride
		int x, y;
		__m128i *dst0, *dst1;
		__m128i *src0map, *src1map;

		register __m128i one_ninth = _mm_set1_epi16(1821 * 2 - 1);
		register __m128i zero = _mm_setzero_si128();
		register __m128i s00, s01, center1, center2;
		register __m128i s00h, s01h;
		register __m128i r00l, r01l, r02l;
		register __m128i r00h, r01h, r02h;
		register __m128i mapv;

		__int32* const src = (__int32* const)(in + xReg);

		__int32* dst0Y = (__int32*)blurred;
		__int32* dst1Y = (__int32*)(blurred + step);

		__int32* src0Ymap = (__int32*)(map + xReg);
		__int32* src1Ymap = (__int32*)(map + step + xReg);

		register int pixStride2 = pixStride << 1;

		// Splitting
		//uint8* argbSrc = (uint8*)src;
		uint8* grayDst = (uint8*)gray;
		uint8* redDst = (uint8*)red;
		uint8* greenDst = (uint8*)green;
		uint8* blueDst = (uint8*)blue;

		//const int sourceImageSizeXDiv16Mul4 = step >> 4;
		//const int destinationImageSizeXDiv16 = step >> 6;
		const int destinationSectionWidthDiv16 = width >> 4; // 16 bytes at once

		 //perform calculation

		//3,7,11,15 ==>A channel
		//2,6,10,14 ==>R channel
		//1,5,9,13  ==>G channel
		//0,4,8,12  ==>B chhanel
		//                                                     R-ch.     G-ch.     B-ch.     A-ch.
		register const __m128i patternSelect = _mm_set_epi8(14, 10, 6, 2, 13, 9, 5, 1, 12, 8, 4, 0, 15, 11, 7, 3);
		register const __m128i nNull = _mm_setzero_si128();
		register const __m128i mulB = _mm_set1_epi16(blueWeight256);
		register const __m128i mulG = _mm_set1_epi16(greenWeight256);
		register const __m128i mulR = _mm_set1_epi16(redWeight256);

#pragma warning(disable:4309) // disable fully bogus warning for bad definition of intrinsic _mm_set1_epi8 method produced by completely paranoidal warning level 4.
		register const __m128i subMask = _mm_set1_epi8(128);
#pragma warning(default:4309)

		register __m128i values1;
		register __m128i values2;
		register __m128i values3;
		//register __m128i values4;


		register __m128i values5cacheRed_012 = _mm_setzero_si128();
		register __m128i values5cacheGreen_012 = _mm_setzero_si128();
		register __m128i values5cacheBlue_012 = _mm_setzero_si128();
		register __m128i values5cacheGray_012 = _mm_setzero_si128();

		register __m128i values5cacheRed_123 = _mm_setzero_si128();
		register __m128i values5cacheGreen_123 = _mm_setzero_si128();
		register __m128i values5cacheBlue_123 = _mm_setzero_si128();
		register __m128i values5cacheGray_123 = _mm_setzero_si128();


		register __m128i values5cacheRed_012s = _mm_setzero_si128();
		register __m128i values5cacheGreen_012s = _mm_setzero_si128();
		register __m128i values5cacheBlue_012s = _mm_setzero_si128();
		register __m128i values5cacheGray_012s = _mm_setzero_si128();

		register __m128i values5cacheRed_123s = _mm_setzero_si128();
		register __m128i values5cacheGreen_123s = _mm_setzero_si128();
		register __m128i values5cacheBlue_123s = _mm_setzero_si128();
		register __m128i values5cacheGray_123s = _mm_setzero_si128();


		register __m128i word0selector = _mm_set_epi32((int)0xFFFFFFFF, 0, 0, 0);

		for (y = 0; y < height; y += 2) // AL: (15.10.2013) ==> don't skip first and last row because circular snapshhot always has this lines!
		{
			// Bluring and mapping
			__int32* row1 = (__int32*)(src + ((y + topLine) & rowMask)*pixStride);
			__int32* row0 = (__int32*)(src + ((y + topLine - 1) & rowMask)*pixStride);
			__int32* row2 = (__int32*)(src + ((y + topLine + 1) & rowMask)*pixStride);
			__int32* row3 = (__int32*)(src + ((y + topLine + 2) & rowMask)*pixStride);

			dst0 = (__m128i *)dst0Y;
			dst1 = (__m128i *)dst1Y;

			src0map = (__m128i *)src0Ymap;
			src1map = (__m128i *)src1Ymap;

			// Splitting (stores 4 pixels (4 bytes = 32 bits) per inner loop, therefore __int32* type pointer.
			__m128i* pDestinationImageRedCurrentPosition012 = (__m128i*)(redDst + y*step);
			__m128i* pDestinationImageGreenCurrentPosition012 = (__m128i*)(greenDst + y*step);
			__m128i* pDestinationImageBlueCurrentPosition012 = (__m128i*)(blueDst + y*step);
			__m128i* pDestinationImageGrayCurrentPosition012 = (__m128i*)(grayDst + y*step);

			__m128i* pDestinationImageRedCurrentPosition123 = (__m128i*)((uint8*)pDestinationImageRedCurrentPosition012 + step);
			__m128i* pDestinationImageGreenCurrentPosition123 = (__m128i*)((uint8*)pDestinationImageGreenCurrentPosition012 + step);
			__m128i* pDestinationImageBlueCurrentPosition123 = (__m128i*)((uint8*)pDestinationImageBlueCurrentPosition012 + step);
			__m128i* pDestinationImageGrayCurrentPosition123 = (__m128i*)((uint8*)pDestinationImageGrayCurrentPosition012 + step);


			register __int64 dec = 4;
			register __int64 inc = 0;
			//register __int64 inc = 0;
			//register __int64 inc32b = -4;
			register __int64 inc32b = -1;
			for (x = 1; x < width - 1; x += 4) // skip first and last column! 4 pixels (16 bytes ~ 128 bytes) at once
			{
				//inc32b += 4;
				++inc32b;
				//++inc;
				--dec;
				//const int ipos = 3 - dec;
				// median for line 0
				s00 = _mm_loadu_si128((__m128i*)(row0 - 1)); //Loads 128-bit value. Unaligned.
				s01 = _mm_loadu_si128((__m128i*)(row0 + 1)); // Loads 128-bit value. Unaligned.
				center1 = _mm_loadu_si128((__m128i*)(row0)); // center1=[row0], here it's correct to overwrite the variable value
				s00h = _mm_unpackhi_epi8(s00, zero); // s00h = hi[row0-1]
				s01h = _mm_unpackhi_epi8(s01, zero); // s01h = hi[row0+1]
				r00h = _mm_unpackhi_epi8(center1, zero); // r00h = hi[row0]
				s00h = _mm_add_epi16(s00h, s01h); // s00h = hi[row0-1]+hi[row0+1]
				s00 = _mm_unpacklo_epi8(s00, zero); // s00 = lo[row0-1]
				s01 = _mm_unpacklo_epi8(s01, zero); // s01 = lo[row0+1]
				s01h = _mm_unpacklo_epi8(center1, zero); // s01 = lo[row0]
				r00h = _mm_add_epi16(s00h, r00h); // r00h = hi[row0-1]+hi[row0+1]+hi[row0]
				s00 = _mm_add_epi16(s00, s01); // s00 = lo[row0-1]+lo[row0+1]
				r00l = _mm_add_epi16(s00, s01h); // r00l = lo[row0-1]+lo[row0+1]+lo[row0]

				// median for line 1
				s00 = _mm_loadu_si128((__m128i*)(row1 - 1)); // s00 = [row1-1]
				s01 = _mm_loadu_si128((__m128i*)(row1 + 1)); // s01 = [row1+1]
				center1 = _mm_loadu_si128((__m128i*)(row1)); // center1=[row1], here it's correct to overwrite the variable value
				s00h = _mm_unpackhi_epi8(s00, zero); // s00h = hi[row1-1]
				s01h = _mm_unpackhi_epi8(s01, zero); // s01h = hi[row1+1]
				r01h = _mm_unpackhi_epi8(center1, zero); // r01h = hi[row1]
				s00h = _mm_add_epi16(s00h, s01h); // s00h = hi[row1-1]+hi[row1+1]
				s00 = _mm_unpacklo_epi8(s00, zero); // s00 = lo[row1-1]
				s01 = _mm_unpacklo_epi8(s01, zero); // s01 = lo[row1+1]
				s01h = _mm_unpacklo_epi8(center1, zero); // s01 = lo[row1]
				r01h = _mm_add_epi16(s00h, r01h); // r01h = hi[row1-1]+hi[row1+1]+hi[row1]
				s00 = _mm_add_epi16(s00, s01); // s00 = lo[row1-1]+lo[row1+1]
				r01l = _mm_add_epi16(s00, s01h); // r01l = lo[row1-1]+lo[row1+1]+lo[row1]

				// median for line 2
				s00 = _mm_loadu_si128((__m128i*)(row2 - 1)); // s00 = [row2-1]
				s01 = _mm_loadu_si128((__m128i*)(row2 + 1)); // s01 = [row2+1]
				center2 = _mm_loadu_si128((__m128i*)(row2)); // center2 = [row2]
				s00h = _mm_unpackhi_epi8(s00, zero); // s00h = hi[row2-1]
				s01h = _mm_unpackhi_epi8(s01, zero); // s01h = hi[row2+1]
				r02h = _mm_unpackhi_epi8(center2, zero); // r02h = hi[row2]
				s00h = _mm_add_epi16(s00h, s01h); // s00h = hi[row2-1]+hi[row2+1]
				s00 = _mm_unpacklo_epi8(s00, zero); // s00 = lo[row2-1]
				s01 = _mm_unpacklo_epi8(s01, zero); // s01 = lo[row2+1]
				s01h = _mm_unpacklo_epi8(center2, zero); // s01 = lo[row2]
				r02h = _mm_add_epi16(s00h, r02h); // r02h = hi[row2-1]+hi[row2+1]+hi[row2]
				s00 = _mm_add_epi16(s00, s01); // s00 = lo[row2-1]+lo[row2+1]
				r02l = _mm_add_epi16(s00, s01h); // r02l = lo[row2-1]+lo[row2+1]+lo[row2]

				// Summarize over lines with index 1,2 ==> result in r02l, r02h
				r02h = _mm_add_epi16(r02h, r01h); // PADDW r02h = hi[line1]+hi[line2]
				r02l = _mm_add_epi16(r02l, r01l); // PADDW r02l = lo[line1]+lo[line2]

				// Load map for lines 0,1,2
				mapv = _mm_loadu_si128(src0map); // 0xFFFFFFFF is foreground, else 0x00000000

				// Summarize over lines with index 0,1,2 ==> result in r00l, r00h
				r00l = _mm_add_epi16(r00l, r02l); // r00l = lo[line0]+lo[line1]+lo[line2]
				r00h = _mm_add_epi16(r00h, r02h); // r00h = hi[line0]+hi[line1]+hi[line2]
				// -- Division by 9 ---
				// low part
				r00l = _mm_mulhrs_epi16(r00l, one_ninth);
				// high part
				r00h = _mm_mulhrs_epi16(r00h, one_ninth);
				// Pack with unsigned saturation ==> result in r00l
				r00l = _mm_packus_epi16(r00l, r00h);
				// Calculate foreground/background with center1
				center1 = _mm_and_si128(center1, mapv); // now center1 is center value if mapv=0xFFFFFFFF or zero if mapv=0x00000000
				r00l = _mm_andnot_si128(mapv, r00l); // now r00l is median value if mapv=0x00000000 or zero if mapv=0xFFFFFFFF
				r00l = _mm_or_si128(center1, r00l); // now r00l is center value if mapv=0xFFFFFFFF or median value if mapv=0xFFFFFFFF
				// Store median of lines with index 0,1,2. center1 can now be reused
				_mm_store_si128(dst0++, r00l);

				// Splitting begin for lines 0,1,2

#define values0 r00l
				values0 = _mm_shuffle_epi8(values0, patternSelect); // values0=values0{[14][10][6][2] [13][9][5][1] [12][8][4][0] [15][11][7][3]}={R3_R2_R1_R0 G3_G2_G1_G0 B3_B2_B1_B0 A3_A2_A1_A0}
				values2 = _mm_unpackhi_epi32(values0, nNull); // = _mm_unpackhi_epi32(values0, values1); //r0:=a2 ; r1:=b2; r2:=a3 ; r3:=b3 ==> values4={ values1.int32[3] values0.int32[3] values1.int32[2] values0.int32[2] } = {R7_R6_R5_R4 R3_R2_R1_R0 G7_G6_G5_G4 G3_G2_G1_G0}
				values1 = _mm_unpacklo_epi32(values0, nNull); // = _mm_unpacklo_epi32(values0, values1); //r0:=a0 ; r1:=b0; r2:=a1 ; r3:=b1 ==> values1={ values1.int32[1] values0.int32[1] values1.int32[0] values0.int32[0] } = {B7_B6_B5_B4 B3_B2_B1_B0 A7_A6_A5_A4 A3_A2_A1_A0}

				// green values = {00_00_00_00 00_00_00_00 00_00_00_00 G3_G2_G1_G0}
				values0 = _mm_unpacklo_epi64(values2, nNull); // = _mm_unpacklo_epi64(values4, values5) //r0:=a0 ; r1:=b0 ==> values1={values5.int64[0] values4.int64[0] } = {G15_G14_G13_G12 G11_G10_G9_G8 G7_G6_G5_G4 G3_G2_G1_G0}
				// red values = {00_00_00_00 00_00_00_00 00_00_00_00 R3_R2_R1_R0}
				values2 = _mm_unpackhi_epi64(values2, nNull); // = _mm_unpackhi_epi64(values4, values5) //r0:=a1 ; r1:=b1 ==> values0={values5.int64[1] values4.int64[1] } = {R15_R14_R13_R12 R11_R10_R9_R8 R7_R6_R5_R4 R3_R2_R1_R0}
				// blue values= {00_00_00_00 00_00_00_00 00_00_00_00 B3_B2_B1_B0}
				values1 = _mm_unpackhi_epi64(values1, nNull); // _mm_unpackhi_epi64(values1, values3); //r0:=a1 ; r1:=b1 ==> values2={values3.int64[1] values1.int64[1] } = {B15_B14_B13_B12 B11_B10_B9_B8 B7_B6_B5_B4 B3_B2_B1_B0}

#define values01Gr values3
//#define values23Gr nNull

				// calculate <01> gray = (redWeight256*<01R> + greenWeight256*<01G> + blueWeight256*<01B>)/256
				values01Gr = _mm_mullo_epi16(_mm_unpacklo_epi8(values2, nNull), mulR); // values01R
				values01Gr = _mm_add_epi16(values01Gr, _mm_mullo_epi16(_mm_unpacklo_epi8(values0, nNull), mulG)); // + values01G
				values01Gr = _mm_add_epi16(values01Gr, _mm_mullo_epi16(_mm_unpacklo_epi8(values1, nNull), mulB)); // + values01B
				values01Gr = _mm_srli_epi16(values01Gr, 8);
				// calculate <0123> gray in values3
				values3 = _mm_packus_epi16(values01Gr, nNull);// = _mm_packus_epi16(values01Gr, values23Gr); // now values4 is byte-compacted 0123Gr-value

				// convert result to signed bytes
				values2 = _mm_add_epi8(values2, subMask); // red values (-127..+127)
				values0 = _mm_add_epi8(values0, subMask); // green values (-127..+127)
				values1 = _mm_add_epi8(values1, subMask); // blue values (-127..+127)
				values3 = _mm_add_epi8(values3, subMask); // gray values (-127..+127)

				values5cacheRed_012 = _mm_srli_si128(values5cacheRed_012, 4); // Shift right by 4 bytes;
				values5cacheGreen_012 = _mm_srli_si128(values5cacheGreen_012, 4); // Shift right by 4 bytes;
				values5cacheBlue_012 = _mm_srli_si128(values5cacheBlue_012, 4); // Shift right by 4 bytes;
				values5cacheGray_012 = _mm_srli_si128(values5cacheGray_012, 4); // Shift right by 4 bytes;

				values5cacheRed_012 = _mm_or_si128(values5cacheRed_012, _mm_slli_si128(values2, 12));
				values5cacheGreen_012 = _mm_or_si128(values5cacheGreen_012, _mm_slli_si128(values0, 12));
				values5cacheBlue_012 = _mm_or_si128(values5cacheBlue_012, _mm_slli_si128(values1, 12));
				values5cacheGray_012 = _mm_or_si128(values5cacheGray_012, _mm_slli_si128(values3, 12));

				/*
				if (dec == 0)
				{
					_mm_stream_si128(pDestinationImageRedCurrentPosition012 + inc, values5cacheRed_012);
					_mm_stream_si128(pDestinationImageGreenCurrentPosition012 + inc, values5cacheGreen_012);
					_mm_stream_si128(pDestinationImageBlueCurrentPosition012 + inc, values5cacheBlue_012);
					_mm_stream_si128(pDestinationImageGrayCurrentPosition012 + inc, values5cacheGray_012);
					//ATTENTION - will be done in a second if-block!
					//dec = 4;
				}
				*/
				switch (dec)
				{
				case 0:
					_mm_store_si128(pDestinationImageRedCurrentPosition012 + inc, values5cacheRed_012);
					values5cacheGreen_012s = values5cacheGreen_012;
					values5cacheBlue_012s = values5cacheBlue_012;
					values5cacheGray_012s = values5cacheGray_012;
					//ATTENTION - will be done in a second if-block!
					//dec = 4;
					break;
				case 3:
					_mm_store_si128(pDestinationImageGreenCurrentPosition012 + inc-1, values5cacheGreen_012s);
				case 2:
					_mm_store_si128(pDestinationImageBlueCurrentPosition012 + inc-1, values5cacheBlue_012s);
					break;
				case 1:
					_mm_store_si128(pDestinationImageGrayCurrentPosition012 + inc-1, values5cacheGray_012s);
					break;
					/*
					*/
				}
				// Splitting end for lines 0,1,2

				// median for line 3
				s00 = _mm_loadu_si128((__m128i*)(row3 - 1));
				s01 = _mm_loadu_si128((__m128i*)(row3 + 1));
				center1 = _mm_loadu_si128((__m128i*)(row3)); // center1=[row3], here it's correct to overwrite the variable value
				s00h = _mm_unpackhi_epi8(s00, zero); // s00h = hi[row3-1]
				s01h = _mm_unpackhi_epi8(s01, zero); // s01h = hi[row3+1]
				r00h = _mm_unpackhi_epi8(center1, zero); // r00h = hi[row3]
				s00h = _mm_add_epi16(s00h, s01h); // s00h = hi[row3-1]+hi[row3+1]
				s00 = _mm_unpacklo_epi8(s00, zero); // s00 = lo[row3-1]
				s01 = _mm_unpacklo_epi8(s01, zero); // s01 = lo[row3+1]
				s01h = _mm_unpacklo_epi8(center1, zero); // s01 = lo[row3]
				r00h = _mm_add_epi16(s00h, r00h); // r00h = hi[row3-1]+hi[row3+1]+hi[row3]
				s00 = _mm_add_epi16(s00, s01); // s00 = lo[row3-1]+lo[row3+1]
				r00l = _mm_add_epi16(s00, s01h); // r00l = lo[row3-1]+lo[row3+1]+lo[row3]

				// Load map for lines 1,2,3
				mapv = _mm_loadu_si128(src1map); // 0xFFFFFFFF is foreground, else 0x00000000

				// Summarize over lines with index 3,1,2 ==> result in r00l, r00h
				r00l = _mm_add_epi16(r00l, r02l); // r00l = lo[line3]+lo[line1]+lo[line2]
				r00h = _mm_add_epi16(r00h, r02h); // r00h = hi[line3]+hi[line1]+hi[line2]
				// -- Division by 9 ---
				// low part
				r00l = _mm_mulhrs_epi16(r00l, one_ninth);
				// high part
				r00h = _mm_mulhrs_epi16(r00h, one_ninth);
				// Pack with unsigned saturation ==> result in r00l
				r00l = _mm_packus_epi16(r00l, r00h);
				// Calculate foreground/background with center2
				center2 = _mm_and_si128(center2, mapv); // now center1 is center value if mapv=0xFFFFFFFF or zero if mapv=0x00000000
				r00l = _mm_andnot_si128(mapv, r00l); // now r00l is median value if mapv=0x00000000 or zero if mapv=0xFFFFFFFF
				r00l = _mm_or_si128(center2, r00l); // now r00l is center value if mapv=0xFFFFFFFF or median value if mapv=0xFFFFFFFF
				// Store median of lines with index 0,1,2
				_mm_store_si128(dst1++, r00l);

				// Splitting begin for lines 1,2,3

#define values0 r00l
				values0 = _mm_shuffle_epi8(values0, patternSelect); // values0=values0{[14][10][6][2] [13][9][5][1] [12][8][4][0] [15][11][7][3]}={R3_R2_R1_R0 G3_G2_G1_G0 B3_B2_B1_B0 A3_A2_A1_A0}
				values3 = _mm_unpackhi_epi32(values0, nNull); // = _mm_unpackhi_epi32(values0, values1); //r0:=a2 ; r1:=b2; r2:=a3 ; r3:=b3 ==> values4={ values1.int32[3] values0.int32[3] values1.int32[2] values0.int32[2] } = {R7_R6_R5_R4 R3_R2_R1_R0 G7_G6_G5_G4 G3_G2_G1_G0}
				values1 = _mm_unpacklo_epi32(values0, nNull); // = _mm_unpacklo_epi32(values0, values1); //r0:=a0 ; r1:=b0; r2:=a1 ; r3:=b1 ==> values1={ values1.int32[1] values0.int32[1] values1.int32[0] values0.int32[0] } = {B7_B6_B5_B4 B3_B2_B1_B0 A7_A6_A5_A4 A3_A2_A1_A0}

				// blue values= {00_00_00_00 00_00_00_00 00_00_00_00 B3_B2_B1_B0}
				values2 = _mm_unpackhi_epi64(values1, nNull); // _mm_unpackhi_epi64(values1, values3); //r0:=a1 ; r1:=b1 ==> values2={values3.int64[1] values1.int64[1] } = {B15_B14_B13_B12 B11_B10_B9_B8 B7_B6_B5_B4 B3_B2_B1_B0}
				// red values = {00_00_00_00 00_00_00_00 00_00_00_00 R3_R2_R1_R0}
				values0 = _mm_unpackhi_epi64(values3, nNull); //r0:=a1 ; r1:=b1 ==> values0={values5.int64[1] values4.int64[1] } = {R15_R14_R13_R12 R11_R10_R9_R8 R7_R6_R5_R4 R3_R2_R1_R0}
				// green values = {00_00_00_00 00_00_00_00 00_00_00_00 G3_G2_G1_G0}
				values1 = _mm_unpacklo_epi64(values3, nNull); //r0:=a0 ; r1:=b0 ==> values1={values5.int64[0] values4.int64[0] } = {G15_G14_G13_G12 G11_G10_G9_G8 G7_G6_G5_G4 G3_G2_G1_G0}

#define values01Gr values3

				// calculate <01> gray = (redWeight256*<01R> + greenWeight256*<01G> + blueWeight256*<01B>)/256
				values01Gr = _mm_mullo_epi16(_mm_unpacklo_epi8(values0, nNull), mulR); // values01R
				values01Gr = _mm_add_epi16(values01Gr, _mm_mullo_epi16(_mm_unpacklo_epi8(values1, nNull), mulG)); // + values01G
				values01Gr = _mm_add_epi16(values01Gr, _mm_mullo_epi16(_mm_unpacklo_epi8(values2, nNull), mulB)); // + values01B
				values01Gr = _mm_srli_epi16(values01Gr, 8);

				// calculate <0123> gray in values3
				values3 = _mm_packus_epi16(values01Gr, nNull); // _mm_packus_epi16(values01Gr, values23Gr); // now values4 is byte-compacted 0123Gr-value

				// convert result to signed bytes
				values0 = _mm_add_epi8(values0, subMask); // red values (-127..+127)
				values1 = _mm_add_epi8(values1, subMask); // green values (-127..+127)
				values2 = _mm_add_epi8(values2, subMask); // blue values (-127..+127)
				values3 = _mm_add_epi8(values3, subMask); // gray values (-127..+127)

				values5cacheRed_123 = _mm_srli_si128(values5cacheRed_123, 4); // Shift right by 4 bytes;
				values5cacheGreen_123 = _mm_srli_si128(values5cacheGreen_123, 4); // Shift right by 4 bytes;
				values5cacheBlue_123 = _mm_srli_si128(values5cacheBlue_123, 4); // Shift right by 4 bytes;
				values5cacheGray_123 = _mm_srli_si128(values5cacheGray_123, 4); // Shift right by 4 bytes;

				values5cacheRed_123 = _mm_or_si128(values5cacheRed_123, _mm_slli_si128(values0, 12));
				values5cacheGreen_123 = _mm_or_si128(values5cacheGreen_123, _mm_slli_si128(values1, 12));
				values5cacheBlue_123 = _mm_or_si128(values5cacheBlue_123, _mm_slli_si128(values2, 12));
				values5cacheGray_123 = _mm_or_si128(values5cacheGray_123, _mm_slli_si128(values3, 12));

				/*
				if (dec == 0)
				{
					_mm_store_si128(pDestinationImageRedCurrentPosition123 + inc, values5cacheRed_123);
					_mm_stream_si128(pDestinationImageGreenCurrentPosition123 + inc, values5cacheGreen_123);
					_mm_stream_si128(pDestinationImageBlueCurrentPosition123 + inc, values5cacheBlue_123);
					_mm_stream_si128(pDestinationImageGrayCurrentPosition123 + inc, values5cacheGray_123);
					dec = 4;
					++inc;
				}
				*/
				switch (dec)
				{
				case 0:
					_mm_store_si128(pDestinationImageRedCurrentPosition123 + inc, values5cacheRed_123);
					values5cacheGreen_123s = values5cacheGreen_123;
					values5cacheBlue_123s = values5cacheBlue_123;
					values5cacheGray_123s = values5cacheGray_123;
					dec = 4;
					++inc;
					break;
				case 3:
					_mm_store_si128(pDestinationImageGreenCurrentPosition123 + inc-1, values5cacheGreen_123s);
				case 2:
					_mm_store_si128(pDestinationImageBlueCurrentPosition123 + inc-1, values5cacheBlue_123s);
					break;
				case 1:
					_mm_store_si128(pDestinationImageGrayCurrentPosition123 + inc-1, values5cacheGray_123s);
					break;
					/*
					*/
				}
				// Splitting end for lines 1,2,3

				row0 += 4; // move to next 4 pixels (16 bytes ~ 128 bytes)
				row1 += 4; // move to next 4 pixels (16 bytes ~ 128 bytes)
				row2 += 4; // move to next 4 pixels (16 bytes ~ 128 bytes)
				row3 += 4; // move to next 4 pixels (16 bytes ~ 128 bytes)

				++src0map; // move to next 4 pixels (16 bytes ~ 128 bytes)
				++src1map; // move to next 4 pixels (16 bytes ~ 128 bytes)
			}
			dst0Y += pixStride2; // move to next 2 lines
			dst1Y += pixStride2; // move to next 2 lines

			src0Ymap += pixStride2;
			src1Ymap += pixStride2;
		}
	} // END OF METHOD

.

	void fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core(int width, int height, int step, const int topLine, const int xReg,
		const unsigned long long rowMask, const uint8* const map, const uint8* const in, uint8* const blurred,
		// Splitting
		const void* const gray, const void* const red, const void* const green,
		const void* const blue, const short redWeight256, const short greenWeight256, const short blueWeight256
	)
	{
		if (!mUseAvx2)
			fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core_SSE4(width, height, step, topLine, xReg,
				rowMask, map, in, blurred,
				// Splitting
				gray, red, green, blue, redWeight256, greenWeight256, blueWeight256);
		else
			//fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_1core_SSE4( width,  height,  step, topLine, xReg, 
			fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core_AVX2(width, height, step, topLine, xReg,
				rowMask, map, in, blurred,
				// Splitting
				gray, red, green, blue, redWeight256, greenWeight256, blueWeight256);
	} // END OF METHOD

.

	void fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte(int width, int height, int step, const int topLine, const int xReg,
		const unsigned long long rowMask, const uint8* const map, const uint8* const in, uint8* const blurred,
		// Splitting
		const void* const gray, const void* const red, const void* const green,
		const void* const blue, const short redWeight256, const short greenWeight256, const short blueWeight256
	)
	{
		//ATTENTION: for circular buffers we should only shift a top line, but not a start pointer!
		switch (mNumberOfFeeds)
		{
		case 1:
			fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core(width, height, step, topLine, xReg, rowMask, map, in, blurred,
				// Splitting
				gray, red, green, blue, redWeight256, greenWeight256, blueWeight256);
			break;
		case 2:
		{
			int rowsInUseFirstHalf = height / 2;
			int rowsInUseSecondHalf = height - rowsInUseFirstHalf;
			unsigned long long secondHalfTopLine = (topLine + rowsInUseFirstHalf) & rowMask;
			unsigned long long secondHalfDirectOffset = rowsInUseFirstHalf * step;
			unsigned long long secondHalfOffset = rowsInUseFirstHalf * step; // For splitting
			//unsigned long long secondHalfCircularOffset=secondHalfTopLine * step;
			parallel_invoke(
				[=]()->void
			{
				fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core(width, rowsInUseFirstHalf, step, topLine, xReg, rowMask, map, in, blurred,
					// Splitting
					gray, red, green, blue, redWeight256, greenWeight256, blueWeight256);
			},
				[=]()->void
			{
				fast_blur3x3_BGRA_Snapshot_horiz2d_with_andmap_To_Gr_R_G_B_signed_byte_1core(width, rowsInUseSecondHalf, step, secondHalfTopLine, xReg, rowMask, map + secondHalfDirectOffset, in, blurred + secondHalfDirectOffset,
					// Splitting
					ADD_OFFSET(gray, secondHalfOffset), ADD_OFFSET(red, secondHalfOffset), ADD_OFFSET(green, secondHalfOffset), ADD_OFFSET(blue, secondHalfOffset),
					redWeight256, greenWeight256, blueWeight256);
			}
			);
			break;
		}
		default:
			throw std::exception("Unsupported number of parallel feeds");
		}
		return;
	} // END OF METHOD

Many thanks in advance.

Alex