- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

- Tags:
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I am not sure how much your result will be different after trying to apply /Qpc64 switch.

Another possibility could be related to implementation at hardware/microcode level.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

apply /Qpc64 switch

pmulhrsw isn't a floating point instruction.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I believe, what the operation does is equivalent to mathematical rounding (i.e. to the closest nearest integer), not to the positive infinity as you say. Whether or not this behavior is "better" than truncation or another rounding mode is debatable and depends on the tasks you have.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

A most serious mistake. Thanks for correcting me, andysem.

I was mistaken into thinking it was biased because the formula from MSDN, https://msdn.microsoft.com/en-us/library/bb513995.aspx

was (x * y + 16384) >> 15. This looked very similar to the int(x + 0.5) method for rounding, which I know is biased for negative #s and cringe at. But I didn't realize right shift for negative numbers isn't the same as dividing and casting to an int.

Plus. it wasn't matching my non-SIMD reference implementation, which turns out to be biased since I was calculating int(sum / 9.0f), which rounds towards zero.

I should've had more doubt before questioning the behavior of something implemented in hardware, which would've been vigorously vetted, since int(x + 0.5) would be a very expensive mistake.

_mm_mulhrs_epi16() still has some bias of always rounding x.5 towards + infinity. But that's not a big deal for my application.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

delete me

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

You are mistaken again. According to Intel Intrinsics guide

*Synopsis
__m128i _mm_mulhrs_epi16 (__m128i a, __m128i b)
#include "tmmintrin.h"
Instruction: pmulhrsw
CPUID Feature Flag: SSSE3
Description
Multiply packed 16-bit integers in a and b, producing intermediate signed 32-bit integers. Truncate each intermediate integer to the 18 most significant bits, round by adding 1, and store bits [16:1] to dst.
Operation
FOR j := 0 to 7
i := j*16
tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1
dst[i+15:i] := tmp[16:1]
ENDFOR*

This is equivalent to (x*y + 16384) >> **16**

Or equivalent to int(x + 0.25) method for rounding.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

No, the rounding is to the nearest integer (x * y + 16384) >> 15. If you're not convinced, consider this example:

_mm_mulhrs_epi16(1, 16383) = 0 // 1 * 0.49999999

_mm_mulhrs_epi16(1, 16384) = 1 // 1 * 0.5

I've verified the behavior for all possible x & y, and the behavior is expected except when it overflows at x = -32768 and y = -32768.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

jimdempseyatthecove wrote:

FOR j := 0 to 7

i := j*16

tmp[31:0] := ((a[i+15:i] * b[i+15:i]) >> 14) + 1

dst[i+15:i] := tmp[16:1]

ENDFOR

when i == 0 we have:

dst = ((a*b >> 14) + 1) >> 1 (1)

so this is indeed a round to nearest as per the Intrinsics Guide

btw this can be simplified as follows:

dst = a*b+16384 >> 14 >> 1 (1.1)

dst = a*b+16384 >> 15 (1.2)

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

>>>pmulhrsw isn't a floating point instruction.>>>

Rounding algorithm may involve operations on floating point values.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page