I don't think that theMKL is a right way to go in that case.You should rely on your own optimizations.
So, several optimization techniques could be used. They are as follows:
- Unroll Loops +
- OpenMP +++
- Unroll Loops + SSE ++++
- Unroll Loops + SSE + OpenMP+++++
Complexity marked by a sign '+':
- '+' simple
- '+++++'more complex
It is not clear for me what your platform is and a final decision depends on it.
For example, I don't rely on SSE or OpenMPfor any Embedded platforms.
Also, I wouldn't try to depend on a compiler's optimization. If an algorithm NOT optimized enough,
for example 1,000 code lines instead of 250,a compiler's optimizationwon't improve performance significantly.
Let me know if you're interested to get more details. I have a similar requirements for a
Linear Algebra algorithms on my project.
Just notice, the data type is short whereas mkl routines mainly address the floating point type, like real, complex etc. So mkl is not a right way.
As sergey suggested,
1) optimization manually
2) optimized by Intel compiler, for example, vectorization
3) ortry ipp like ippiAdd_16u/16s_