1. The compiler is writing two memory locations per loop trip.
2. Probably a result of "event skid"? I assume you are referring to the Clockticks event and the numbers in your snippet are samples. I can only speculate that the second access incurs less of a time penalty after the first instruction has executed. Perhaps some processor internals work more efficiently when used sequentially, like that?
3.From the processor manual for MOVNTPD:
"Moves the double quadword in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is an XMM register, which is assumed to contain two packed double-precision floating-point values. The destination operand is a 128-bit memory location."