SSE2 signed 32bit integer multiplication

hoditohod · ‎07-07-2010

Hi All,
I have integer only application which I want to speed up a bit with sse2. The tight loop uses add/sub/shift on int32_t so I could easily convert it with intrinsics.
Before the tight loop the program performs an expensive setup step, which is also done on vectors, but involves some multiplication and division. Having the multiplication SSE'd could help gain some more performance (I leave the division as it is since there's no integer division in sse).

I looked up the intrinsics and found that the PMULUDQ instruction is used for 2 things:
- multiply a signed int by a signed int (32bitx32bit -> 64bit) _mm_mul_su32()
- multiply 2 unsigned ints by 2 unsigned ints (32bitx32bit -> 64bit again) _mm_mul_epu32()

Now I need signed int 32bitx32bit -> 32bit (4 int vector), but this is only available in SSE4, or at least signed int 32bitx32bit -> 64bit (2 int vector). At first glance it didn't seem to be possible with sse2, but Googling around, people did use _mm_mul_epu32() for _signed_ integer multiplications. I've created a small function with intrinsics(based on asm found with google) and it really works, but I don't know why...

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
__m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
__m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
}

Can someone please explain me?
2's complement negative integers have all MSB bits 1, and this is true for the lower 4 bytes of the result (that's why I get a proper 32bit signed int), but for the upper 4 bytes the bit pattern is 1011 (decimal 11).
:|
How do the sign bits in the upper 4 get magically fixed when I use _mm_mul_su32()? It maps to the same instruction isn't it?

Regards,
Gyorgy Szekely

Aubrey_W_ · ‎07-15-2010

I'm moving this from the Intel AVX and CPU instructions forum. The compiler forums are a better place for general assembler questions (or SSE2), as the Intel compilers support inline assembly, and you will be more likely to get an answer here.

Thanks for your question.

==

Aubrey W.

Intel Software Network Support

JenniferJ · ‎07-19-2010

Instead of using SSE2 instrinsics directly, you can use the Intel C++ Compiler to generate the SSE2 automatically so you don't have to maintain the intrinsics but only specify a compiler option. The compiler will use SSE2 through out the program as much as possible. The compiler can also generate multiple code path for SSE2 or SSE3 or SSE4 etc.

The Intel Compiler comes with document about the SSE intrinsics as well. You could try out the eval of Intel C++ compiler for 30-days. If it's not long enough, you can get one more eval using the same email addr.

Thanks,
Jennifer

emmanuel_attia · ‎07-26-2010

Hello,

Correct me if i'm wrong, I strongly doubt that the Intel compiler is able to emil these kind of SSE multiplication automatically (here, mullo between unsigned long).

It does it only when it is really straighforward (like mulsd and mulss).

Plus I haven't measured, but you can do this in less instructions using SHUFPS:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
__m128 tmp1 = _mm_castsi128_ps(_mm_mul_epu32(a,b)); /* mul 2,0*/
__m128 tmp2 =_mm_castsi128_ps( _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4))); /* mul 3,1 */
return _mm_castps_si128(_mm_shuffle_ps(tmp1, tmp2, _MM_SHUFFLE (2,0,2,0)));
}

Regards,

Emmanuel

Dale_S_Intel · ‎07-29-2010

Could you perhaps illustrate with an example of what your original (non-sse) code would look like? As you have seen, it's not quite simple with SSE2. Of course, I could suggest that you try using a processor that supports SSE4.1, and thus the PMULLD instruction, as I can assure you the compiler handles that quite nicely, but then I'd be in danger of sounding like a salesman :-)

The vectorizer sometimes does fairly sophisticated things like you're describing, so it might be worth a shot. If you post a sample loop of what you're trying to do we could try it out. In my case, with this function:

[bash]foo(int *a, int *b, int *c, int N)
{
    int i;

    for (i=0; i = b*c;
    }
}
[/bash]

It works fine with -xSSE4.1.

Thanks!
Dale