By resetting MKL_DYNAMIC, I believe you are disabling MKL's own effort to place threads efficiently, so you should be setting KMP_AFFINITY or MKL equivalent directly for each number of threads, according to your platform. Probably 1 thread per CPU, if a dual CPU, for 2 threads, 1 thread per cache, if a split cache CPU; never 2 threads per core when other cores are idle, ....
As you are using dgemv, you might expect performance to drop as soon as you run 2 threads each on 1 or more cores; are you trying to quantify that?
If your cache footprint is very large, it is possible to see your performance peak as soon as you have enough threads to use all of the cache. In such a case, VTune cache events could clarify it.
Have you tried export MKL_DYNAMIC=TRUE
It will suggest MKl to choose the good the threading number for the problem. As Tim noted, for the DGEMV, DDOT function, increasing the threading number may not improve the performance. If MKL_DYNAMIC is FALSE, it will force MKL to the threading you set.
Overheadon joining OMP-threads can be significant if your data-volume in MKL functions is notbig enough.
IsSMT (hyper-threading) on?
So that 8 CPUs means: 4-cores with hyper-threading. Please clarify.
If logical thread is bound tonot oneCPUs (likely two) then SMT is ON.
Also please look at related MKL articles/discussions:
BTW, If Hyper-Threading technology is enabled on the systems, it is recommended that the threading numbers be set equal to the number of real processors or cores. That is only half number of the logical processors.