You don't give any information to begin to answer most of your questions.
With regard to (e), auto-vectorization can be expected to be associated with high bus utilization and higher CPI than equivalent non-vector code. This points up the fact that those measurements don't correlate with the efficiency of your code. It's almost in the same category with the idea of avoiding vectorization in order to inflate your threaded parallel scaling ratings. The most efficient way to lower CPI is to add useless instructions (the opposite approach to vectorization).
As to (c), if you turned on gradual underflow (e.g. by a compilation option such as /fp:source), you might get more information by comparing it with a run where you set abrupt underflow (e.g. by putting /Qftz after /fp:source for your main()), or by executing the SSE intrinsic).
Shooting in the dark on (a), does it make a difference when you use a good affinity mapping? Presumably, you are using some kind of multi-threading, but I can't imagine how you expect us to guess what it is.