movaps running very slow

nick_1234 · ‎03-12-2012

I was doing some simple timing tests and I noticed that movaps, and most of the sse floating point instructions, were running slow.

I have my code below. I tested doing a bunch of NOPs, another test with various register integer additions, another test with SSE integer instructions, and another with SSE floating point instructions
I wrote the results in comments above each block of code below.
My computer is Intel Core2, so it can execute multiple instructions at once (superscalar processing capabilities)
It appears that for NOPs and most other instructions I can get 3 instructions through per cycle (timing with rdtsc)
But with the SSE floating point operations I can only get 1 through per cycle.
Can anyone make sense of this?

NOPRetireTest_A PROC

; retires 10 billion NOPS
mov rax, 10000000
@@:

CNTR = 0
WHILE CNTR LT 250
;; retires 3 per clock
;;nop
;;nop
;;nop
;;nop

;; retires 3 per clock
;;inc r8
;;inc r9
;;inc r10
;;inc r11

;; retires 1 per clock - makes sense because of heavy dependencies
;;inc r8
;;inc r8
;;inc r8
;;inc r8

;; retires 3 per clock
;;pxor xmm0, xmm4
;;pxor xmm1, xmm4
;;pxor xmm2, xmm4
;;pxor xmm3, xmm4

;; retires 3 per clock
;;movdqa xmm0, xmm4
;;movdqa xmm1, xmm4
;;movdqa xmm2, xmm4
;;movdqa xmm3, xmm4

;; only retires 1 per clock. WHY???
;xorps xmm0, xmm4
;xorps xmm1, xmm4
;xorps xmm2, xmm4
;xorps xmm3, xmm4

;; only retires 1 per clock. WHY???
movaps xmm0, xmm4
movaps xmm1, xmm4
movaps xmm2, xmm4
movaps xmm3, xmm4

CNTR = CNTR + 1
ENDM
dec rax;
jnz @b
RET
NOPRetireTest_A ENDP
END

Max_L · ‎03-13-2012

it is becasue there is only 1 Floating Point execution unit for logic operations (XORPS) and register copies (MOVAPS) - FP workloads almost never require more throughput for these - but for vector integer workloads you can do up to 3 of MOVDQA's or PXOR's per cycle

it may be interesting to you that in upcoming Ivy Bridge register copies are not being executed at all and resolved during register renaming in the Front End

-Max

gol · ‎05-05-2012

Isn't it ironic that Intel recommend(ed) not to mix float & integer instruction purposes, yet here MOVDQA would be more efficient than MOVAPS when working with floats?

Max_L · ‎05-10-2012

no, recommendation is100% correct, as using MOVDQA on registers that are passed from+to FP domain instructions would incur additional latencies (2+2 on Nehalem, 1+1 on Sandy Bridge) due to bypass between execution domains. I'll reiterate that the absolute majority of FP codes do not require higher throughput for register copies (i.e. adding more units would not have improved performance), so there is not real issue here.

And now, beginning with Ivy Bridge, these register copy operations are not being executed altogether but simply resolved at the register renaming stage in the Front End.

Plus with non-destructive property introduced in AVX there is hardly a need forregister copies going forward.

-Max