IppiMulC_8u_C4RSfs kinda slow

gol · ‎05-01-2009

Last time I reporteda function was slow, it's the function that I didn't understand, but here'sone that's more simple/obvious.

It's typically used to light up/down 8 bit RGBA bitmaps. I already had my own function, and tried to replace it by this one, but it ended up around 2.5x slower (or maybe 2x slower when the scale factor is zero, but for pretty much any use you need the scale factor with this function).

My function isn't exactly the same, it has no scale factor, it's assumed to be 8. However since the function lights up by 8bit values, a scale factor above or different than 8 looks absolutely unnecessary as for compatibility. In any case it was necessary, it could still be a special case for the (usual?) scale factor of 8 here.

Ok so it's all MMX on a Q6600.
Assuming that IPP would probably useSSE2 instead of an obsolete MMX, maybe it's theory that failed vs practice here? I don't know, but it's definitely faster for pretty much the same thing/same results.
Note that with 16-byte aligned source & dest, IppiMulC_8u_C4RSfs is still slower but only by 1.5x or something (with aligned buffers and scale factor to zero, it's still slower).

I also don't know if the IPP function rounds or truncates, I would assume truncates here (because of scale factor). A handy instruction here that I used (for rounding) to use in this code was PMULHRW, but it's a 3DNow one.

(pascal convention) Source buffer in EAX, dest buffer in EDX, length in ECX, Alpha as a parameter.
Alpha is a single value, but it's pretty easy to adapt the first line for a packed RGBA. However this function canuse alpha above 255.

MOVD mm7,Alpha
PSHUFW mm7,mm7,00000000b // 4*16bit RGB Value

PUSH ECX
SHR ECX,1
TEST ECX,ECX
JZ @Remainder

@Loop:
MOVD mm4,[EAX ]
MOVD mm5,[EAX+4]

PXOR mm0,mm0
PXOR mm1,mm1

PUNPCKLBW mm0,mm4
PUNPCKLBW mm1,mm5

PMULHUW mm0,mm7
PMULHUW mm1,mm7

PACKUSWB mm0,mm0
PACKUSWB mm1,mm1

MOVD [EDX ],mm0
MOVD [EDX+4],mm1

ADD EAX,8
ADD EDX,8
DEC ECX
JNZ @Loop

@Remainder:
POP ECX
AND ECX,1
JECXZ @TheEnd

// remainder
MOVD mm4,[EAX]
PXOR mm0,mm0
PUNPCKLBW mm0,mm4 // 4*16bit RGB Color in high bytes
PMULHUW mm0,mm7
PACKUSWB mm0,mm0
MOVD [EDX],mm0

@TheEnd:
EMMS

Vladimir_Dudnik · ‎05-06-2009

Hello,

there is comment provided by our expert:

1.We would suggest improvement in your code - instead of:

PACKUSWB mm0,mm0
PACKUSWB mm1,mm1

MOVD [EDX ],mm0
MOVD [EDX+4],mm1

youcan use:

PACKUSWB mm0,mm1
MOVQ [EDX ],mm0

2.It is not clear howyou can use our ippiMulC_8u_C4RSfs function ifyour alpha valueis Ipp16u.

3.This sample is a classical case of a special Mul function (with scaleFactor = 16) and truncation of Mul operation's low part (by using PMULHUW instruction). For comparison you can see our IPP optimized core for scaleFactor = 1..16:

do {

t1 = _mm_load_si128((__m128i*)pSrc);
t0 = _mm_unpacklo_epi8(t1, emmZero);
t1 = _mm_unpackhi_epi8(t1, emmZero);
t0 = _mm_mullo_epi16(t0, emmValue);
t1 = _mm_mullo_epi16(t1, emmValue);
t2 = _mm_srli_epi16(t0, 1);
t0 = _mm_and_si128(t0, emm_1);
t3 = _mm_srl_epi16(t2, emmScaleFactor);
t0 = _mm_add_epi16(t0, emmConst);
t3 = _mm_and_si128(t3, emm_1);
t0 = _mm_add_epi16(t0, t3);
t0 = _mm_srli_epi16(t0, 1);
t0 = _mm_add_epi16(t0, t2);
t0 = _mm_srl_epi16(t0, emmScaleFactor);
t2 = _mm_srli_epi16(t1, 1);
t1 = _mm_and_si128(t1, emm_1);
t3 = _mm_srl_epi16(t2, emmScaleFactor);
t1 = _mm_add_epi16(t1, emmConst);
t3 = _mm_and_si128(t3, emm_1);
t1 = _mm_add_epi16(t1, t3);
t1 = _mm_srli_epi16(t1, 1);
t1 = _mm_add_epi16(t1, t2);
t1 = _mm_srl_epi16(t1, emmScaleFactor);
t0 = _mm_packus_epi16(t0, t1);
_mm_store_si128((__m128i*)pDst, t0);

pSrc += 16;
pDst += 16;

} while( --tmp );

Regards,
Vladimir

gol · ‎05-06-2009

1. We would suggest improvement in your code - instead of:

Oops indeed, I didn't think of that.
I also haven't compared loading the first 2 pixels with a MOVQ & unpacking, that mayas well be faster. I also can't align my code (Delphi here), that may also cost some cycles as well.

It is not clear how you can use our ippiMulC_8u_C4RSfs function if your alpha value is Ipp16u.

I more or less need an alpha in 0..256, but I can live without 256, afterall the function wouldn't do anything in this case.

I see you indeed use SSE instead of MMX.

If I'm not mistaken, considering that for the 8bit version of ippiMulC, the alpha is also 8bit, that leaves us with valid scale factors up to 15 (if truncating)or 16 (if rounding, although it wouldn't be much useful).
My code would normally cover full precision for scale factors up to 8. But the thing is, while you may need scale factors below 8 (to lighten UP), scale factors above 8 would only be to lighten down with a bit more precision (if that even makes sense for 8bit bitmaps), so it would be faster for the most common needs.

But anyway, for me this ippiMulC function is expected in this format in the library because it matches similar functions for signals & other parts of the library. That's logical, however, I think it's weird for practical use, and also that some key functions are missing, while I see lot of math functions on bitmaps for which I can't imagine any use (like bit-shifting & xor-ing.. bitmaps?).

What I'd have liked to see in ippi are functions to:

-mul & add by constants, typically to fill an area with a color, with blending. This is common, and right now you'd have to use mul then add, losing precision

-add by a signed value, instead of AddC/SubC, in order to fill an area with additive blending (right now you can use ippiAdd & ippiSub without losing precision, but well it's 2 calls)

-fill an area using a mask, but what I don't understand with masks used in ippi, is that they're booleans, not alpha values, that's a lot less useful IMHO.
I tried to see what I could use to mask-fill with a color, something you'd need for a font blitter for example (although with Cleartype you'd rather need an RGB mask but anyway), but I couldn't find anything that wouldn't involve many steps & temp buffers.

Now I understand the library can never be complete, it's just that I find weird to see functions that seem to be totally useless on bitmaps, while key ones are missing.

Btw, I can imagine Ippi being useful for image editing/analysis, not-so-much for games because you'd need speed only a 3D card can provide, but certainly for GUI's, because for a GUI you'd rather avoid having to rely on risky 3D hardware if possible.
This is why I was talking about these basic filling functions that are often used for GUI elements.
However, something that's more & more useful is vectorial stuff. So, functions to accelerate rasterizing would be welcome in IPP. There's the existing (all-software, slow but very versatile)Antigrain, theupcoming OpenVG, the upcoming M$ Direct2D, I don't know who will win the war, but I Think it really is the CPU's job to do vectorial stuff, so I'd love to see something like this in a future IPP.
I've already used block-square root processing from ipps, in order to speed up antialiased disk & line drawing, it worked quite well.

gol · ‎05-06-2009

PACKUSWB mm0,mm1
MOVQ [EDX ],mm0

Btw, this didn't make the code faster, in fact it's slightly slower. I never really try to understand why, there are so many rules behind..

Ultimately the loading too can be simplified, so it becomes as small as:

@Loop:
MOVQ mm4,[EAX]

PXOR mm0,mm0
PXOR mm1,mm1

PUNPCKLBW mm0,mm4
PUNPCKHBW mm1,mm4

PMULHUW mm0,mm7
PMULHUW mm1,mm7

PACKUSWB mm0,mm1
MOVQ [EDX],mm0

ADD EAX,8
ADD EDX,8
DEC ECX
JNZ @Loop

..but this is slightly slower too. Could be the MOVQ on unaligned data.
I unrolled the loop from 2 to 4x, and it does get faster, but ultimately not faster than the original, longer not-so-well-thought version.

MOVQ mm4,[EAX]
MOVQ mm5,[EAX+8]

PXOR mm0,mm0
PXOR mm1,mm1
PXOR mm2,mm2
PXOR mm3,mm3

PUNPCKLBW mm0,mm4
PUNPCKHBW mm1,mm4
PUNPCKLBW mm2,mm5
PUNPCKHBW mm3,mm5

PMULHUW mm0,mm7
PMULHUW mm1,mm7
PMULHUW mm2,mm7
PMULHUW mm3,mm7

PACKUSWB mm0,mm1
PACKUSWB mm2,mm3

MOVQ [EDX],mm0
MOVQ [EDX+8],mm2

ADD EAX,16
ADD EDX,16
DEC ECX

(side note: I've tried an SSE2 version, using pretty much the same instructions, but xmm-style. It's more or less 1/3 faster on aligned data, and 1/3 slower on unaligned data. And it's really linked to alignment, something I've noticed in general is that it's better to load unaligned data as separate parts, rather than using the unaligned packed movs (2 MOVQ's+interleaving end up faster than one MOVDQU))

Finally, I don't much understand what's after the mul in your code from IPP. Is it for rounding? Well I can't figure out why it's needed, is it to support scale factors above 8?
t2 = _mm_srli_epi16(t0, 1);
t0 = _mm_and_si128(t0, emm_1);
t3 = _mm_srl_epi16(t2, emmScaleFactor);
t0 = _mm_add_epi16(t0, emmConst);