Weird compiler "optimization"

levicki · ‎02-11-2011

Consider this code:

__m128i bg_color = _mm_shuffle_epi8(_mm_cvtsi32_si128(*(int*)b), shuffle_mask);
__m128i fg_color = _mm_shuffle_epi8(_mm_cvtsi32_si128(*(int*)f), shuffle_mask);

Then after a loop which uses bg_color and fg_color:

int val = *ptmp;
for (int i = 0; i < rs; i++) {
if (val & (1 << (7 - i))) {
*(pdst + 0) = _mm_extract_epi8(fg_color, 0);
*(pdst + 1) = _mm_extract_epi8(fg_color, 1);
*(pdst + 2) = _mm_extract_epi8(fg_color, 2);
} else {
*(pdst + 0) = _mm_extract_epi8(bg_color, 0);
*(pdst + 1) = _mm_extract_epi8(bg_color, 1);
*(pdst + 2) = _mm_extract_epi8(bg_color, 2);
}
pdst += 3;
}
ptmp += 1;

What compiler does is to move those pextrb outside of the loop, and store the results of pextrb onto stack.

Then it unrolls the loop and instead of using pextrb where i told him it does this:

mov edx, DWORD PTR [60+esp] ;56.6
$LN121:
mov BYTE PTR [eax+ebx*2], dl ;56.6
$LN122:
mov edx, DWORD PTR [64+esp] ;57.6
$LN123:
mov BYTE PTR [1+eax+ebx*2], dl ;57.6
$LN124:
mov edx, DWORD PTR [44+esp] ;58.6
$LN125:
mov BYTE PTR [2+eax+ebx*2], dl ;58.6

I mean optimization is one thing (I know that it can and should take constant expressions out of the loop for example), but what the heck is this? Computational result is the same alright, but if I wanted to do a bunch of memory to memory copies I would do it manually.

Can someone knowledgeable enough (an Intel Engineer for example) explain why the compiler is doing this?

Is it really more cost effective in terms of performance to unroll a short running loop (rs <= 7), and use a ton of branches and memory to memory copies (through a register) than to emit the code as it was written using intrinsics?

jimdempseyatthecove · ‎02-14-2011

Igor,

Why not allocate your pdst buffer one byte larger than necessary then use

int val = *ptmp;
for (int i = 0; i < rs; i++) {
if (val & (1 << (7 - i))) {
_mm_storeu_ps((float*)pdst, fg_color);
} else {
_mm_storeu_ps((float*)pdst, bg_color);
}
pdst += 3;
}
ptmp += 1;

(or save/restore one byte after the bufffer)

In this case, I think using gp registers may be more efficient than using SSE
Something that will compile using the conditional move instruction (eliminates a branch).

Jim Dempsey

Brandon_H_Intel · ‎02-14-2011

Hi Igor,

I had a code generator developerread your post. He said it's too hard to see what's going on without a test case. I tried to mock one up, but I'm not getting code anywhere close to what you're describing. Can you submit something that compiles to what you describe, and (maybe more importantly) what compiler options you're using?

levicki · ‎02-15-2011

Jim,

Writing one more byte is not an option here -- I am writing out RGB triplets directly to a texture buffer which is later uploaded to a GPU memory -- output format must remain the same.

Furthermore, compiler seems to be unable to use conditional moves here (my guess is due to high register pressure).

Brandon,

Sure, here you go -- I have attached self-contained test case which generates mentioned code. Compile options included in the file. You can also add /QxSSE4.2 but it doesn't change the output much given that the code is written with intrinsics already.

The function in the code converts duotone (1-bit) BMP file into BGR using SIMD. If someone has better idea on how to perform such a task I am open to suggestions.

If you compile with same options you will see the code I am talking about in the test_code.asm at label .B1.15:.

I didn't test the impact of compiler using GPRs and memory to memory copying .vs. keeping pextrb in the loop because that would require writing the whole function in assembler and I am kind of busy at the moment. I would really like to know whether cost modeling heuristics is doing the right thing though.

jimdempseyatthecove · ‎02-16-2011

Igor,

>>Writing one more byte is not an option here -- I am writing out RGB triplets directly to a texture buffer which is later uploaded to a GPU memory -- output format must remain the same.

If the texture buffer is in bulk RAM (as opposed to video frame buffer), my suggestion still works. You get 3-byte packed triplets....
except that you write (clobber) one byte past the end of the buffer. Therefore

char RGBbuffer[pixles*3+1];
translateColors(bitArray,RGBbuffer,pixles);
copyToGPU(RGBbuffer,pixles);

The suggested code writes 4 bytes, three bytes containing the designated colors andone bytewith junk, then walks down the buffer by 3 bytes for the next triplet. IOW the subsequent write occurs 3 bytes after the preceeding write.

Jim Dempsey

jimdempseyatthecove · ‎02-16-2011

Igor,

Make this change to your test code, try it and report performance results.
*** remember that the dst buffer must be allocated to at least 1 byte larger than necessary for usage as texture buffer.

[cpp]		if (rs != 0) {
  #if 0
			int val = *ptmp;
			for (int i = 0; i < rs; i++) {
				if (val & (1 << (7 - i))) {
					*(pdst + 0) = _mm_extract_epi8(fg_color, 0);
					*(pdst + 1) = _mm_extract_epi8(fg_color, 1);
					*(pdst + 2) = _mm_extract_epi8(fg_color, 2);
				} else {
					*(pdst + 0) = _mm_extract_epi8(bg_color, 0);
					*(pdst + 1) = _mm_extract_epi8(bg_color, 1);
					*(pdst + 2) = _mm_extract_epi8(bg_color, 2);
				}
				pdst += 3;
			}
  #else
			u8 val = *ptmp;
			u8 valMask = 0x80;
			for (int i = 0; i < rs; i++) {
				if (val & valMask) {
					_mm_storeu_ps((float*)pdst, (__m128)fg_color);
				} else {
					_mm_storeu_ps((float*)pdst, (__m128)bg_color);
				}
				val <<= 1;
				pdst += 3;
			}
  #endif
			ptmp += 1;
		}
[/cpp]

Jim

levicki · ‎02-17-2011

Jim,

Thanks for the suggestion, I may try it out.

However, I still don't like the idea of allocating 1 byte more because that complicates things when you have padded rows in the destination or if you have to divide destination into tiles.

jimdempseyatthecove · ‎02-17-2011

>>However, I still don't like the idea of allocating 1 byte more because that complicates things when you have padded rows in the destination or if you have to divide destination into tiles.

Then when doing tiles, I would suggest investigating tiles of 48 u8's wide (by however long you want). Assemble the output into 3 SSE registers, then write the 48 bytes using three writes. And handle the non-48 u8's wide residual tiles with different code.

The 1-byte overrun would be problematic if multi-threading, however the 48 u8's wide method would be thread safe. Then if you quadruple the the tile width again, you would end up on a cache line alignment. 192 byte wide tiles is 64 RGB pixles. If you use a rolling register technique, you need not require 12 registers for the accumulation of the pixles. IOW roll in 3 or 6 registers.

Jim Dempsey

levicki · ‎02-18-2011

I was already considering that :)

Anyway, that still doesn't answer my question about compiler so lets wait for Brandon to get back to us.

levicki · ‎02-25-2011

Brandon, any news on this?

Brandon_H_Intel · ‎02-25-2011

Hi, I wasn't catching updates on this thread. Thanks, Jennifer, for alerting me.

The compiler is pulling these loop invariants:

_mm_extract_epi8(fg_color, 0);

_mm_extract_epi8(fg_color, 1);

_mm_extract_epi8(fg_color, 2);

_mm_extract_epi8(bg_color, 0);

_mm_extract_epi8(bg_color, 1);

_mm_extract_epi8(bg_color, 2);

Out of the loop so that it only has to compute them once. It spills them to memory due to register pressure, but the loads from memory should still be cheaper on the whole than calling pextrb every time.

The unrolling is a little more unclear to me - let me follow up on that.

levicki · ‎02-25-2011

Hmm...

I know about invariant expression elimination, but there is no computation involved -- pextrb should be pretty efficient on Nehalem and newer CPUs, right?

What is more interesting is that if the compiler kept fg_color and bg_color in two SIMD registers as I intended, it wouldn't have to spill the values to memory because it wouldn't need GPRs for data movement at all -- it would just write out directly from SIMD register each time.

Good thing is that the loop trip count here is low, so it probably doesn't make much difference in this particular case, but if the compiler would use the same logic to optimize longer running loops I am not so sure that would be desireable behavior.

What also bothers me a bit is that when I am writing the code with intrinsics I am expecting it to be as verbatim as possible to what I wrote, because each intrinsic usually maps to a single assembler instruction -- I must admit that I really do not like the idea of the compiler messing with my original intent in this way.

Reason why I don't like it is because I believe that developers using intrinsics are usually advanced enough to know what they want to accomplish -- this kind of optimization can prevent them from doing what they want and even worse, force them to rewrite the offending code in assembler. What I am saying is that if I really wanted to eliminate the expression from the loop and use GPRs I would do it myself. You should be very carefull with such overzealous optimizations.