it is becasue there is only 1 Floating Point execution unit for logic operations (XORPS) and register copies (MOVAPS) - FP workloads almost never require more throughput for these - but for vector integer workloads you can do up to 3 of MOVDQA's or PXOR's per cycle
it may be interesting to you that in upcoming Ivy Bridge register copies are not being executed at all and resolved during register renaming in the Front End
no, recommendation is100% correct, as using MOVDQA on registers that are passed from+to FP domain instructions would incur additional latencies (2+2 on Nehalem, 1+1 on Sandy Bridge) due to bypass between execution domains. I'll reiterate that the absolute majority of FP codes do not require higher throughput for register copies (i.e. adding more units would not have improved performance), so there is not real issue here.
And now, beginning with Ivy Bridge, these register copy operations are not being executed altogether but simply resolved at the register renaming stage in the Front End.
Plus with non-destructive property introduced in AVX there is hardly a need forregister copies going forward.