Thanks for your inputs but none of the above made any difference to the code I played with KMP_BLOCKTIME for an hour or more. I set it to 0 200 inf and what not but it lead to nowhere. Somtimes it sped up the execution for a given input data but when the data was changed, the optimality was lost.
What is the difference between linking using -libomp5 and -openmp From my experiments, I found -libomp5 to be much much faster than -openmp.
-libomp5 shouldn't work; did you mean -liomp5 ? The latter is set by ifort -openmp, but you would need to specify the library explicitly if you were using some other command for linking. The KMP environment variables are specific to Intel OpenMP, while the OMP ones are in accordance with OpenMP standard. A purpose of increasing KMP_BLOCKTIME would be to maintain KMP_AFFINITY settings across a gap of more than 0.2 second between OpenMP parallel regions. It's entirely possible that KMP_BLOCKTIME has little effect in normal circumstances.
If your application doesn't have enough inherent parallelism to benefit from threading, GPU is not a likely solution. It's true that BLAS level 2 operations, which normally would be vectorized, would need to operate on extremely large data sets to benefit from threaded parallelism internal to those operations. Thus it is normal to apply parallelism at a higher level (each thread performing independent entire level 2 operations).