When there is a scalefactor on the end, does this affect the performance much? I really don't need it, but there isn't a version without the scalefactor. If I set it to 1, does this really do the same thing as not having a scale factor?
Also, there is a sub function that looks like this:
In many cases the format of the input vector and the output vector need to be the same. That's what the functions are made for.
If you need to process data on higher bitdepth it often will be faster to do it once at the beginning of the algorithm, then process the 32bit data and at last convert back to 16bit. Every function which has different input-/output bit depths needs to convert each value (in case of SSE2 8 values) into this higher bitdepth before doing the operation. This will increase the latency very much and therefore reduces the performance.
Of course there are some functions with different in-/output bitdepths missing, but I think even Intel needs time to react on customer wishes and implement them. So maybe they will be there sometimes :)