Solved: Parallel dependence in bitmap scaling code

DLake1 · ‎02-03-2018

This method is for copying a portion of a large bitmap (RGB8) and flipping it top to bottom as a new image and also downscaling said bitmap by an integer multiplier.

Its for real-time rendering of images too big for most applications and I need to figure out where the parallel dependence is and how I can optimize it.

__declspec(dllexport) void Copy(unsigned char* __restrict src, const long long sst, unsigned char* __restrict dst, const long long vst, const long long count, const long long zmul){
	if (zmul <= 1){
		for (auto i = 0; i < count; ++i){
			memcpy(dst + i*vst, src - i*sst, vst);
		}
	} else{
		const auto st = (sst + vst)*zmul;
		const auto zmsq = zmul*zmul;
		const auto zmul3 = zmul * 3;
		for (auto i = 0; i < count; ++i, src -= st){
			for (auto j = 0; j < vst; j += 3, src += zmul3){
				unsigned int r = 0, g = 0, b = 0;
				for (auto k = 0; k < zmul; ++k) {
					for (auto l = 0; l < zmul; ++l) {
						r += src[k*3 + l*sst];
						g += src[k*3 + 1 + l*sst];
						b += src[k*3 + 2 + l*sst];
					}
				}
				dst[i*vst + j] = r / zmsq;
				dst[i*vst + j + 1] = g / zmsq;
				dst[i*vst + j + 2] = b / zmsq;
			}
		}
	}
}

jimdempseyatthecove · ‎02-07-2018

>>So do I have to create a version with every possible zmul value or is there something else I can do to get the same effect?

No. Most likely a handful of the smaller values, (constant iteration count), then as you have for larger counts.

An additional optimization to add would be to treat the vertical summation differently from the horizontal summation.

For example, construct 3 varients, one for SSE, AVX and AVX512. This would use the intrinsic functions for each archetecture.

SSE vector has 16 bytes/8 shorts, sizing for 3 registers can handle a reduction of up to 8 x nn pixels.
AVX/AVX2 has 32 bytes/16 shorts, sizing for 3 registers can handle a reduction of up to 16 x nn pixels.
AVX512 has 64 bytes/32 shorts, sizing for 3 registers can handle a reduction of up to 32 x nn pixels.

The vertical summations, of multiples of 8, 16 or 32 pixels, can be performed as SIMD vectors. Lesser widths can be performed as SIMD using the mask varient of the instructions or explicitly masking with AND instructions when not. The final horizontal summation can be summed initially using SIMD, without and with mask/AND until you reduce to 3 registers. Then you will have to use the intrinsics that swizzle/shuffle the lanes about... or if you are lazy you can dump the 3 vectors to RAM then sum using C code.

You will have to weigh the amount of work you invest for the returns gained.

Jim Dempsey

View solution in original post

MGRAV · ‎02-05-2018

Hi,

personally I see any strong dependence. You can parallelize where you like.

Personally, I will parallelize (in static) at the highest level line 10, and check if vectorisation happen in the inner loop over the source, lines 14 and 15.(in other cases you can probably force it with omp simd reduction operator)

I am surprised that you didn’t find a code in libraries that do that well, IPP, OpenCV ...

When you say big image how big is this image ? If they are very, very ... very big, I suppose the best solution is to divide in blocks like for GPUs.

DLake1 · ‎02-05-2018

I've attached an optimization report.

If I parallelize either of the the outer 2 loops the image I get is garbled into zmsq+zmul/2 strips across the display and if I vectorize either of the inner 2 loops the result is slower.

I prefer not to use library's so I can learn how to write and optimize code like this myself.

I discovered heic1502a, the sharpest ever view of the Andromeda Galaxy and was unable to open the 4.31GB PSB (planar uncompressed) file because I dont use Photoshop so I took the opportunity to write my own program to open it.

I have just started experimenting with CUDA but if I cant parallelize it in a CPU I wont have a hope in hell of getting it to work with CUDA.

MGRAV · ‎02-06-2018

Ok, I understood better what you try to do, the source image is constant. First with you approach you can work with 4 values pro pixel instead of 3, that while probably improve performance(especially for vectorisation). You do the modification once at the origin, as initialization process, that increases the memory usage of 1/3. At the same moment you can flip lines of the image, that is done one forever.

If you analyze the situation, you see that the part that take time is when you need to average lots of pixels, to solve these specific issues usually, we pre-compute a pyramid of images at different scales.

If it's just for visualization, I suppose you can just do sampling, the result "look" usually better, less smooth, and it's definitely much much faster. And you will have so much memory read that you have pixels to display. Probably, you can even handle that with a single thread.

DLake1 · ‎02-06-2018

You've got the idea Mathieu but with such massive images I need to conserve memory so I wouldn't want to introduce a 4th sub pixel or pre compute different sizes and I could pre flip the image but it doesn't seem worth the effort as it doesn't seem to affect performance or optimization.

jimdempseyatthecove · ‎02-06-2018

Commander,

The outermost loop should parallel fine *** as long as src is private to the parallel region.

The innermost loop will likely not vectorize well due to the size difference of the vectors (unsigned char to unsigned int), as well as it being unknown as to how many pixles are being averaged. You might want to consider specialized variations for 2, 3, 4, ... x factors, as well as using unsigned short for r,g,b when factors are less than 256.

Jim Dempsey

DLake1 · ‎02-06-2018

I managed to increase the performance 5x by manually specifying zmul in the code and replacing "src += zmul3" and "src -= st" with index arithmetic!

const auto zmul = 2;
const auto st = sst*zmul;
const auto zmsq = zmul*zmul;
const auto zmul3 = zmul * 3;
for (auto i = 0; i < count; ++i) {
	for (auto j = 0; j < vst; j += 3) {
		unsigned int r = 0, g = 0, b = 0;
		for (auto k = 0; k < zmul; ++k) {
			for (auto l = 0; l < zmul; ++l) {
				r += src[k*3 + l*sst + j/3*zmul3 - i*st];
				g += src[k*3 + 1 + l*sst + j/3*zmul3 - i*st];
				b += src[k*3 + 2 + l*sst + j/3*zmul3 - i*st];
			}
		}
		dst[i*vst + j] = r / zmsq;
		dst[i*vst + j + 1] = g / zmsq;
		dst[i*vst + j + 2] = b / zmsq;
	}
}

So do I have to create a version with every possible zmul value or is there something else I can do to get the same effect?

jimdempseyatthecove · ‎02-07-2018

>>So do I have to create a version with every possible zmul value or is there something else I can do to get the same effect?

No. Most likely a handful of the smaller values, (constant iteration count), then as you have for larger counts.

An additional optimization to add would be to treat the vertical summation differently from the horizontal summation.

For example, construct 3 varients, one for SSE, AVX and AVX512. This would use the intrinsic functions for each archetecture.

SSE vector has 16 bytes/8 shorts, sizing for 3 registers can handle a reduction of up to 8 x nn pixels.
AVX/AVX2 has 32 bytes/16 shorts, sizing for 3 registers can handle a reduction of up to 16 x nn pixels.
AVX512 has 64 bytes/32 shorts, sizing for 3 registers can handle a reduction of up to 32 x nn pixels.

The vertical summations, of multiples of 8, 16 or 32 pixels, can be performed as SIMD vectors. Lesser widths can be performed as SIMD using the mask varient of the instructions or explicitly masking with AND instructions when not. The final horizontal summation can be summed initially using SIMD, without and with mask/AND until you reduce to 3 registers. Then you will have to use the intrinsics that swizzle/shuffle the lanes about... or if you are lazy you can dump the 3 vectors to RAM then sum using C code.

You will have to weigh the amount of work you invest for the returns gained.

Jim Dempsey