- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I see on Wikipedia that SSE4.1 has the instructions PMULDQ and PMULLD for packed signed multiplication. So are there any instructions (preferably intrinsic functions) for packed unsigned multiplication?
Also, if I were only to use the compiler switch -msse4 without any explicit intrinsic functions, what kind of improvement can I expect, i.e. mostly from string functions such as strlen and strcmp that "automatically" get a performance boost?
Thank you.
Regards,
Rayne
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The intrinsics for PMULDQ is
_mm_mul_epi32()
for PMULLD is _mm_mullo_epi32().
The intrinsics are same on Intel and Microsoft Visual Studio compiler (however there is a typo on MSDN website).
You can easily find the intrinsic names through the software developer manual.
http://www.intel.com/Assets/PDF/manual/253667.pdf
There is an unsigned packed doubleword integer multiply instruction PMULUDQ and intrinsics are:
_mm_mul_su32()
_mm_mul_epu32()
Regarding second question, i beleive you are using Intel compiler (latest is 11.1) as it supports auto vectorization (you dont need to write intrinsics). I have found that compiler is very aggressive in autovectorization if it finds a loop that can be vectorized (based on compiler heuristics) it will vectorize for you. You can also enabled vectorization report, it will print which loop got vectorized and which not. So, you may want to look at those loops to vectorize them. The switch for SSE4.1 or SSE4.2 is as follow
-QxSSE41. -QxSSE4.2 -arch:SSE4.1
to get the vectorization report use following switch
/Qcon-gen=2, you can play with this switch with multiple "n" values. Please check icl help for more information.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The intrinsics for PMULDQ is
_mm_mul_epi32()
for PMULLD is _mm_mullo_epi32().
The intrinsics are same on Intel and Microsoft Visual Studio compiler (however there is a typo on MSDN website).
You can easily find the intrinsic names through the software developer manual.
http://www.intel.com/Assets/PDF/manual/253667.pdf
There is an unsigned packed doubleword integer multiply instruction PMULUDQ and intrinsics are:
_mm_mul_su32()
_mm_mul_epu32()
Regarding second question, i beleive you are using Intel compiler (latest is 11.1) as it supports auto vectorization (you dont need to write intrinsics). I have found that compiler is very aggressive in autovectorization if it finds a loop that can be vectorized (based on compiler heuristics) it will vectorize for you. You can also enabled vectorization report, it will print which loop got vectorized and which not. So, you may want to look at those loops to vectorize them. The switch for SSE4.1 or SSE4.2 is as follow
-QxSSE41. -QxSSE4.2 -arch:SSE4.1
to get the vectorization report use following switch
/Qcon-gen=2, you can play with this switch with multiple "n" values. Please check icl help for more information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page