Solved: SSE4

hjazz · ‎01-25-2010

Hi,

I see on Wikipedia that SSE4.1 has the instructions PMULDQ and PMULLD for packed signed multiplication. So are there any instructions (preferably intrinsic functions) for packed unsigned multiplication?

Also, if I were only to use the compiler switch -msse4 without any explicit intrinsic functions, what kind of improvement can I expect, i.e. mostly from string functions such as strlen and strcmp that "automatically" get a performance boost?

Thank you.

Regards,

Rayne

Brijender_B_Intel · ‎01-26-2010

The intrinsics for PMULDQ is

_mm_mul_epi32()

for PMULLD is _mm_mullo_epi32().

The intrinsics are same on Intel and Microsoft Visual Studio compiler (however there is a typo on MSDN website).

You can easily find the intrinsic names through the software developer manual.

http://www.intel.com/Assets/PDF/manual/253667.pdf

There is an unsigned packed doubleword integer multiply instruction PMULUDQ and intrinsics are:

_mm_mul_su32()

_mm_mul_epu32()

Regarding second question, i beleive you are using Intel compiler (latest is 11.1) as it supports auto vectorization (you dont need to write intrinsics). I have found that compiler is very aggressive in autovectorization if it finds a loop that can be vectorized (based on compiler heuristics) it will vectorize for you. You can also enabled vectorization report, it will print which loop got vectorized and which not. So, you may want to look at those loops to vectorize them. The switch for SSE4.1 or SSE4.2 is as follow

-QxSSE41. -QxSSE4.2 -arch:SSE4.1

to get the vectorization report use following switch

/Qcon-gen=2, you can play with this switch with multiple "n" values. Please check icl help for more information.

View solution in original post

Brijender_B_Intel · ‎01-26-2010

The intrinsics for PMULDQ is

_mm_mul_epi32()

for PMULLD is _mm_mullo_epi32().

The intrinsics are same on Intel and Microsoft Visual Studio compiler (however there is a typo on MSDN website).

You can easily find the intrinsic names through the software developer manual.

http://www.intel.com/Assets/PDF/manual/253667.pdf

There is an unsigned packed doubleword integer multiply instruction PMULUDQ and intrinsics are:

_mm_mul_su32()

_mm_mul_epu32()

Regarding second question, i beleive you are using Intel compiler (latest is 11.1) as it supports auto vectorization (you dont need to write intrinsics). I have found that compiler is very aggressive in autovectorization if it finds a loop that can be vectorized (based on compiler heuristics) it will vectorize for you. You can also enabled vectorization report, it will print which loop got vectorized and which not. So, you may want to look at those loops to vectorize them. The switch for SSE4.1 or SSE4.2 is as follow

-QxSSE41. -QxSSE4.2 -arch:SSE4.1

to get the vectorization report use following switch

/Qcon-gen=2, you can play with this switch with multiple "n" values. Please check icl help for more information.

hjazz · ‎01-26-2010

Thank you very much for your reply!