It looks that you are computing the vector with size 20000. Actually, IPP MX functions are optimized for operations on small matrices and small vectors, particularly for matrices of size 3x3, 4x4, 5x5, 6x6, and for vectors of length 3, 4, 5, 6.
For the simple C code you test, the Compiler can easily vectorize the code, and get good performance.
In your code, the following inner loops take most the time. It just sub a constant temp_b
Actually good replacement for such code is use the following IPP function call: