I need help optimizing an implementation of a function to compute the normal probability density function of a vector of n values. This is for use in a mex call from Matlab, so basically my goal is for the IPP-implemented version of the function to be faster than Matlab's "normpdf" toolbox function.
I have an implementation working that uses ippsSub, not in place, and then ippsSqr, ippsMulC, ippsExp, and ippsDiv, all in place. I get the values in from Matlab -- I don't copy them -- then I use a Mathworks provided call for allocating an output array, then call the not-inplace subtraction from the input to the output buffer, and then do the rest of the arithmetic operations in place on the output buffer.
The above results in an implementation that is faster than Matlab if n is less than about a million for single precision data. Above a million Matlab becomes consistently faster than my implementation. To me this doesn't make sense. If anything I would think this IPP-based implementation I am describing would be slower for small n's and faster for high n's. Can anyone explain the behavior I'm seeing? or provide suggestions for optimizations I can make?
Also, in the IPP "Reference Manual, Volume 1: Signal Processing" from March 2009, I see that there used to be a call in the "Speech Recognition Functions" section called ippsExpNegSqr, but it doesn't seem to be around any more. What happened to this function?