The Intel MKL doesn't contain such functionality. You can try to develop your owner code and build it by the Intel Compiler with some high-level optimizations options. It would produce very good performance results.
i've forget to add, you can find the similarfunctionalityinto another performance library - IPP. see Signal Processingdomain: functions Sub (Subtracts the elements of two vectors) and then Sqrt (Computes a square root of each element of a vector.
As this appears to involve 3 IPP functions (vector subtract, vector multiply, sum reduction), it may take twice as long as plain C or Fortran with a vectorizing compiler, and won't be a lot more readable even than what you would write with intrinsics in MS compiler.
?sum (add without taking absolute value) should work here too. Since you say you want speed, plain C or Fortran code, with a vectorizing compiler, would be a better bet. If you are using the current Intel C or Fortran, and haven't figured out how to use the simd reduction directive, let us know which compiler you want sample code for. Do you know Fortran ? sqrt(dot_product(x-y,x-y))
If that doesn't vectorize (requires /fp:fast), it's embarrassing, but you could then fall back on the directive. If your cases are long enough for threading to be useful, and auto-parallel doesn't do the entire job, we could give you examples in openmp, or maybe someone on TBB or cilk or ArBB forum could help you, if you have chosen one of those C++ namespaces.
(untested; show us your code) sum = 0 !dir$ omp parallel do reduction(+: sum) private (diff) if(n>9999) !dir$ simd reduction(+: sum) do i=1,n diff = x(i) - y(i) sum = sum + diff**2 end do yourresult = sqrt(sum)
In fact, the Qparallel /Qpar-threshold0 appear to handle this better than the OpenMP and simd directives.
In my example, when alignment is asserted, the compiler chooses only vectorization, no threading, even at par-threshold0. As the SSE2 vector code produces 4 parallel sums, which are added implicitly at the end, implementing 8 loop iterations per pass, you can see that a big data set would be needed before threading could prove beneficial.