Single precision ADD and MUL latencies on Ivybridge

perfwise · ‎08-22-2012

Hi,

I've looked in the opt guide, and it states that latencies for SP and DP fp ADD and MUL instructions is still 3 and 5 cycles.. but on ADDSS and ADDPS I measure a 4 cycle latency now, whereas on Intel SB it was 3. DP variants are still 3 cycles (ADDSD and ADDPD). Likewise.. I measure on MULSS and MULPS a latency of 6 cycles now.. whereas I only measured a latency of 5 before. DP is the same, 5 cycles as before. I am doing repetitive loops with lots of one instructions to determine throughput.. and latency is similarly determined but now with chained dependencies. So a chained dependency would be:

addss xmm0,xmm1

addss xmm0,xmm2

addss xmm0,xmm1

addss xmm0,xmm2

...

where xmm1 and xmm2 are the negatives of one another.

Thanks for any advice..

Perfwise

perfwise · ‎08-25-2012

Hi,

Just inquiring whether anyone at Intel knows what the answer to the question above is. It's quite easy to build this code and run it and get the latency of the instruction and confirm or correct me. Thanks for any helpful response..

Perfwise

TimP · ‎08-25-2012

According to what we were told yesterday, core update 3 instruction latencies haven't been validated for publication. Yes, I find it even easier to make a mistake on those update numbers than on the internal model nicknames which we have been asked not to use.
The primary emphasis is on the vmulss, vmulps, and the like, which immediately zero the upper register contents rather than preserving dependencies on previous instructions. The compilers don't necessarily account for those differences yet. You might check the generated code, particularly if you aren't using intrinsics. The compiler would insert instructions such as xorps to break those hidden dependencies for SSE.
I don't know if the legacy SSE instruction latencies would be quoted if they should come out different from the AVX ones.

perfwise · ‎08-25-2012

Tim,

Yeah.. but I'm coding in assembly and have about 2500 different tests which measure latency and throughput, and they're accurate on x86 arch. I've noted that both SSE and AVX addss and addps instructions come up as 1 clk longer in latency on my Ivybridge processor than on SandyBridge. Can you confirm.. on your end that you see the same.

The opt guide of the Intel devloper pdf says there's no diff in latency.. but something's amiss. I could code up an assembly test for ya if it's too much work.. but it's quite simple to build a test to do this. I just want to confirm what I'm seeing. FP load latency is the same as it was in SB, as is the integer load latency (int is 4 clks, fp is 5-6 clks looks like [lea latencies interfer though with this because of the form I'm using and your implementation]).

Let me know.. and maybe tomorrow I'll post up an assembly test you could run.. but thought I'd bring this to your attention.

Perfwise

perfwise · ‎08-26-2012

Tim/Intel,

This latency issue was bothering me. So I dug in for a half hour this morning and built a simple asm test you can use to see what I'm talking about. Interestingly.. the behavior appears to be different on SB and on IvyB.

I've added 2 files.. just compile with gcc and run this test. You'll note you get a latency of 4 clocks.

Now.. if you comment out lines 21 and 22.. which removes the replicated floats in bits 32-127, you'll see you get a latency of 3 clocks. Replicating the dest had no effect on latency.. but the sources.. having non-zero bits in 32-127, appears to extend the latency of addss by 1 clock, and likely mulss as well. I doubt this is the case very many times.. but still.. can you inform me why this is the case? What changed from SB to IvyB?

Addss only needs bits 0-31, it doesn't rely upon the source value of source 2 for bits 32-127, why does this matter?

Perfwise

TimP · ‎08-26-2012

I'm not an expert on this question; I have remote access to an IVB single CPU lab machine when required, but my main job has been MIC/Phi.
You're correct AFAIK that IVB/core i7-3 is advertised as not changing latencies, aside from improvements in divide/sqrt/unaligned 256-bit. We havn't reached the stage of seeing in practice whether more cores will be available.
As I pointed out, performance of AVX-128/256 is probably a higher priority than SSE2, and the treatment of both AVX and SSE by production compilers is probably of greatest importance to most people; there are potentially significant differences in those compiler treatments. From a personal point of view, I would not be surprised by a small change in performance of "legacy" (SSE/SSE2) instructions. I looked superficially at compiler treatments of SNB vs. IVB and didn't see likely distinctions made such as might take advantage of (usually minor) changes in IVB.
You may have intended to attach something, which didn't happen.

perfwise · ‎08-27-2012

Tim,

The files are attatched now. I also believe the observations are the same for VADDSS and VMULSS. So this isn't an SSE statement. Though. as I said it's unlikely to have much performance impact.. I just wanted to know why I'm observing this. Thanks..

Perfwise