It looks like vdSqrt does not work correctly in multi-threaded mode. Its performance is significantly slower than a simple loop of sqrt() calls. Moreover, it ignores OMP_NUM_THREADS completely.
I have made the attached test program to reproduce the bug, which had a dramatic impact and was hard to spot in amore complex application. This simple program computes the square root of a vector of 10000 elements for 5000 times. In order to prevent the compiler from removing code, some randomy selected items are added to a variable which is later printed to the standard output.
Three different ways of computing the square root are tested: 1) vdSqrt 2) loop of sqrt() 3) OpenMP parallel loop of sqrt()
Since I'm on a dual core machine (Core i5 430M), I run the code with OMP_NUM_THREADS=1 and OMP_NUM_THREADS=2. This measurements are obtained in a VMware virtual machine but the same behavior was observed on physical machines as well. The issue was reproduced on Debian 5, Ubuntu 8.04 and Ubuntu 10.10.
OMP_NUM_THREADS=1 OMP_NUM_THREADS=2 1) 0.99 s 0.96 s 2) 0.33 s 0.33 s 3) 0.34 s 0.21 s
If I force the code to run on fewer core by setting processor affinity with taskset the results are vastly different for vdSqrt: taskset -c 0 taskset -c 0,2 1) 0.33 s 1.19 s
In a more complex program, replacing vdSqrt with a loop of sqrt yielded a 9x speedup.
Performing a concurrency analysis with VTunes gives a better insight on the problem. While the third method and other MKL routines such as dgemv and dgemm only spawn a single worker thread with OMP_NUM_THREADS=2, vdSqrt spawns about 20 worker threads. The attached pictures highlights the issue.