- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I have integer only application which I want to speed up a bit with sse2. The tight loop uses add/sub/shift on int32_t so I could easily convert it with intrinsics.

Before the tight loop the program performs an expensive setup step, which is also done on vectors, but involves some multiplication and division. Having the multiplication SSE'd could help gain some more performance (I leave the division as it is since there's no integer division in sse).

I looked up the intrinsics and found that the PMULUDQ instruction is used for 2 things:

- multiply a signed int by a signed int (32bitx32bit -> 64bit) _mm_mul_su32()

- multiply 2 unsigned ints by 2 unsigned ints (32bitx32bit -> 64bit again) _mm_mul_epu32()

Now I need signed int 32bitx32bit -> 32bit (4 int vector), but this is only available in SSE4, or at least signed int 32bitx32bit -> 64bit (2 int vector). At first glance it didn't seem to be possible with sse2, but Googling around, people did use _mm_mul_epu32() for _signed_ integer multiplications. I've created a small function with intrinsics(based on asm found with google) and it really works, but I don't know why...

static inline __m128i muly(const __m128i &a, const __m128i &b)

{

__m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/

__m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */

return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */

}

Can someone please explain me?

2's complement negative integers have all MSB bits 1, and this is true for the lower 4 bytes of the result (that's why I get a proper 32bit signed int), but for the upper 4 bytes the bit pattern is 1011 (decimal 11).

:|

How do the sign bits in the upper 4 get magically fixed when I use _mm_mul_su32()? It maps to the same instruction isn't it?

Regards,

Gyorgy Szekely

Link Copied

4 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I'm moving this from the Intel AVX and CPU instructions forum. The compiler forums are a better place for general assembler questions (or SSE2), as the Intel compilers support inline assembly, and you will be more likely to get an answer here.

Thanks for your question.

==

Aubrey W.

Intel Software Network Support

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The Intel Compiler comes with document about the SSE intrinsics as well. You could try out the eval of Intel C++ compiler for 30-days. If it's not long enough, you can get one more eval using the same email addr.

Thanks,

Jennifer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Correct me if i'm wrong, I strongly doubt that the Intel compiler is able to emil these kind of SSE multiplication automatically (here, mullo between unsigned long).

It does it only when it is really straighforward (like mulsd and mulss).

Plus I haven't measured, but you can do this in less instructions using SHUFPS:

static inline __m128i muly(const __m128i &a, const __m128i &b)

{

__m128 tmp1 = _mm_castsi128_ps(_mm_mul_epu32(a,b)); /* mul 2,0*/

__m128 tmp2 =_mm_castsi128_ps( _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4))); /* mul 3,1 */

return _mm_castps_si128(_mm_shuffle_ps(tmp1, tmp2, _MM_SHUFFLE (2,0,2,0)));

}

Regards,

Emmanuel

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

The vectorizer sometimes does fairly sophisticated things like you're describing, so it might be worth a shot. If you post a sample loop of what you're trying to do we could try it out. In my case, with this function:

[bash]foo(int *a, int *b, int *c, int N) { int i; for (i=0; i= b *c; } } [/bash]

It works fine with -xSSE4.1.

Thanks!

Dale

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page