cblas_drotg vs dlartg vs dlartgp vs my own using sqrt
I have an application where I use these primitives heavily to generate Givens rotations. In my experiments, using an AVX-enabled environment and using Intel MKL 11 beta update 2 I have observed the following points below. I was invoking these primitives hoping that MKL was doing something really smart and get a special speed up over a plain sqrt version, why is not that so? is there any documentation on number of flops or better cycles needed for these routines?
cblas_drotg leads to non-convergence of my algorithm (too many round errors) I haven't tried setting CBWR to COMPATIBLE though .. need to try that.
dlartg is slow
dlartgp is faster than dlartg I was actually puzzled by this, since I expected that dlartgp gives more guarantees namely positiveness of the diagonal elements.
my own plain sqrt version (see below) outperforms all above and has no errors and also gives positiveness of the diagonal elements (needed for updating a Cholesky decomposition => need positiveness of the trace i.e. eigenvalues to compute log of the trace).