Ipps--multiplying two 16-bit numbers needs 32-bit product

Derek_Woodman · ‎03-16-2010

Hello,

I want to use IPPS to do vector arithmetic. I found the following function:

IppStatus ippsMul_16s_Sfs(const Ipp16s* pSrc1, const Ipp16s* pSrc2, Ipp16s* pDst, int len, int scaleFactor);

However, what if the two numbers I am multiplying require 32-bits for the product? I want something like:

IppStatus ippsMul_16s32s(const Ipp16s* pSrc1, const Ipp16s* pSrc2, Ipp32s* pDst, int len);

Do these function exist? Am I just not looking in the correct place?

Thanks!

Derek_Woodman · ‎03-16-2010

Ok, I found the following function:

IppStatus ippsMul_16s32s_Sfs(const Ipp16s* pSrc1, const Ipp16s* pSrc2, Ipp32s* pDst, int len, int scaleFactor);

That should work for multiplying two vectors. But what about the MulC variant. Do I have to copy the 16bit vector into a 32bit vector and then just use the 32s variant?

Also, I kinda have a general question. How long does it take to copy vectors? How about inplace functions versus not-in-place functions. Do these have very high performace differences?

Derek_Woodman · ‎03-16-2010

Ok sorry for all the replies.

But I meant to comment on the functions I found:

When there is a scalefactor on the end, does this affect the performance much? I really don't need it, but there isn't a version without the scalefactor. If I set it to 1, does this really do the same thing as not having a scale factor?

Also, there is a sub function that looks like this:

IppStatus ippsSub_16s32f(const Ipp16s* pSrc1, const Ipp16s* pSrc2, Ipp32f* pDst, int len);

I don't really want a floating point representation because I am just working with integers. Why isn't there a 32s version?

renegr · ‎03-17-2010

In many cases the format of the input vector and the output vector need to be the same. That's what the functions are made for.

If you need to process data on higher bitdepth it often will be faster to do it once at the beginning of the algorithm, then process the 32bit data and at last convert back to 16bit.
Every function which has different input-/output bit depths needs to convert each value (in case of SSE2 8 values) into this higher bitdepth before doing the operation. This will increase the latency very much and therefore reduces the performance.

Of course there are some functions with different in-/output bitdepths missing, but I think even Intel needs time to react on customer wishes and implement them. So maybe they will be there sometimes :)