We are getting about 23% better single-thread performance from Haswell over Ivy Bridge at the same clock speed on our server workload. Running VTune General Exploration I found that Haswell produced 1/4 of the Icache misses produced on Ivy Bridge. The number of branch mispredictions was about the same (and fairly low for a server app with few small loops). Since both processors have the same size top-level Icache, what is the explanation? In Intel advertising literature I see that Haswell "Initiates TLB and cache misses speculatively" and "Handles cache misses in parallel to hide latency", but no further specifics on Icache changes.
Since I always look gift horses in the mouth, does anyone have an explanation? Are there any other VTune counters I can use to shed some light on Haswell's remarkable performance improvement?
Answering these questions is probably complicated.
First some background questions... can you tell me the brand string of each of the processors?
And which events does VTune use for iCache misses? Is it the same event on both processors?
And some general 'server performance analysis' questions... forgive me if you are familiar with all this stuff but the questions might help others (even if they are old hat to you) and the server experrts would ask me if I had asked you anyway...
You are looking at single threaded performance... is the application 1) really running just 1 thread on 1 cpu (so 1 thread total) or 2) are you really using all cpus but just using 1 thread per cpu (so total threads = # cpus)?
If you are using just 1 thread total, are you pinning the thread to a cpu or can the thread migrate around from one cpu to another cpu?
If the thread can migrate then you might get more icache misses for the case where the thread is migrating.
Is turbo mode enabled? If turbo is enabled on haswell and not on ivybridge, then haswell would do better (all other things being equal).
Lastly, you say that performance is 23% better on haswell. Is the better performance as reported by the application (like on ivb you get 100 transactions/sec and on hsw you get 123 transactions/sec) ?
And do you also execute 23% more instructions_retired/sec on haswell? (This assumes that you don't have some polling in the software and that the number of instructions_retired/sec is proportional to work done.
These questions will help me see if the you are REALLY seeing better performance or just looking at vtune counters (which have to be taken with a grain of salt).
Counters data as it was said by Pat could describe the exact results because they are global wide and not restricted to specific IP. Small loops can be detected by LSD and if your application contains such a loops with no more than 28 decoded micro ops then then there is no need to decode and load micro ops from Icache,but I think the same functionality is available on Ivy Bridge also.Can you compare ITLB misses on both CPU's?
I am also thinking about the possibility that instruction cache already has the same micro ops as your executing program at least part of them and thus those micro ops when fetched could be discarded by some logic because they are present in cache already.Think about the mov assembly instruction decoded to micro op(s) it will be a part of every code.
Thank-you Patrick and Iliya for your prompt replies.
The Ivy Bridge is Xeon E5-2690 V2 @ 3.00GHz and the Haswell is Core i7 4770S @ 3.10GHz. Turbo-mode and hyperthreading are disabled on both processors. The threads are affinitized (pinned) to particular cores, and do not move around. There are multiple threads, but most execution is concentrated in a single thread for this test (98.7% on Ivy Bridge, 94% on Haswell); the other threads are mostly idle.
Windows Server 2008 R2 was used on the Ivy Bridge and Windows Server 2012 R2 was used on the Haswell, but this particular test spends little time in the OS. In particular, the threads are affinitized to particular processors in advance, so Microsoft's algorithms for allocating threads to processors aren't used.
I have attached to this post spreadsheets extracted from VTune's summary page for both processor runs, allowing comparison of counter values where possible.
I executed VTune by manually attaching it to the running process, manually starting the benchmark, and then manually stopping VTune after the benchmark finished. So we got 609e9 INST_RETIRED.ANY for Haswell and 593e9 for Ivy Bridge which are close but not exactly equal. The CPU_CLK_UNHALTED.THREAD is 295e9 for Haswell and 347e9 for Ivy Bridge, giving CPI's of .484 and .585 respectively. So Haswell is about 17% better. The benchmark itself reports a proprietary measure of work done per processor second. This is roughly 23% better for Haswell, so VTune and the proprietary measure are both in the same ballpark.
The ICACHE.MISSES counter is 2.28e9 on Haswell and 8.48e9 on Ivy Bridge, the most glaring discrepancy. The BR_MISP_RETIRED.ALL_BRANCHES_PS counter is 520e6 on Haswell and 516e6 on Ivy Bridge, not much different. Our application is profile-guided whole-program optimized and inlined by the Intel compiler, so we are used to seeing these good branch prediction results. We are also used to seeing lots of Icache misses.
The ITLB_MISSES_WALK_DURATION counter is 2.49e9 on Haswell and 2.71e9 on Ivy Bridge, slightly better on Haswell.
There are almost no small loops (only very large loops) in the application, so the the loop stream detector would be ineffective for this test. Because hyperthreading is disabled, the Haswell instruction decode queue would have length 56, as opposed to 28 on Sandy Bridge. However, I think that Ivy Bridge's IDQ also has length 56 when hyperthreading is disabled. In any event, I don't think we have many loops that could fit into 56 micro-ops.
I have read that Haswell's micro-op cache (DSB) has the same size and structure as Ivy Bridge's. Furthermore, I've read that the DSB is "included" in the Icache, so a hit in the DSB would also hit the Icache. So this can't be the source of the difference in Icache misses between Haswell and Ivy Bridge.
Thanks again for helping me!
Sorry about the previous message. Not sure what happened there.
In any case, the spreadsheet (re-attached) has a side by side comparison of your 2 systems: ivb and snb. I'm not sure if this upload worked...
I computed some quantities such as pathlength (instructions_retired per 'transaction'). It shows that IVB uses about 1.2x more instructions per transaction. This seems odd... perhaps the different OS's are having an impact or the extra instructions are getting executed on IVB's 6 more cores/socket.
Also, if the IVB system has 2 sockets and the HSW system has just 1 socket, if the IVB 2nd socket is mosly idle, it can increase the latency of cache misses on the non-idle socket. If you run a low priority spinner loop app on the 2nd socket, then you can eliminate this variable as possible explanation for the performance difference. The spinner app keeps the socket from going to sleep... going to sleep increases the snoop latency time... servers aren't built so much for their single threaded mode performance.
Can you measure with Xperf frequency of context switches? It could be possible that more privileged thread or even Vtune clock timer interrupt handler and its DPC routine is executed on the same core as your app on IB system thus polluting the Icache.Regarding the loops can you post their disassembly? Pat's explanation could be also very possible to occure.
Thanks for looking at the spreadsheets! The benchmark does a fixed amount of work and runs for a variable measured amount of processor time. So it always does the same number of "transactions". Haswell had 609e9 instructions retired and Ivy Bridge had 593e9. These are within 3% of each other, so given that I started and stopped VTune manually, both processors executed about the same number of instructions on the benchmark. I probably drank more coffee before I ran the Ivy Bridge benchmark!
The benchmark reports about a 23% better "transactions/processor second" measure on Haswell than on Ivy Bridge. If we take the reciprocal of CPI we get instructions/clock (IPC), which is a "work/unit time" metric like the benchmark's report. This gives 2.07 IPC for Haswell and 1.71 IPC for Ivy Bridge, i.e. Haswell is about 21% better than Ivy Bridge. So both VTune and the benchmark report roughly the same performance increase. Since I got the 23% better benchmark report when Vtune wasn't running, it isn't a VTune artifact.
This particular benchmark doesn't stress the memory subsystem, only the on-chip caches. For example, MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM is only 2e6 from the spreadsheet, which averages roughly once every 250,000 instructions. The Ivy Bridge has two sockets and the Haswell has only one. Also, the Ivy Bridge has 1333MHz DRAM while the Haswell has 2133MHz DRAM. If I slow down the Haswell's memory to match the Ivy Bridge at 1333MHz then the Haswell performance on this benchmark drops by only about 2%.
The code working set size is typically much greater than the 32KB ICache, definitely smaller than the 3rd level cache, and probably smaller than the 2nd level cache. The only loops of any consequence are at a very high level, involving calls through many layers of software (DBMS, etc.), and are too long to analyze in assembly language. The only small loops are for things like decimal-binary conversion, formatting, block moves, and traversing very short linked-lists. They make-up only a few percent of execution time.
Using Xperf, I don't detect any appreciable interference from other threads. We do have timer threads which interrupt the main thread every millisecond, however these execute very little code and are identically present on both Ivy Bridge and Haswell.
I think this leaves Haswell's improved ICache prefetch algorithms as the probable cause of the performance increase. For example, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE is roughly halved when going from Ivy Bridge to Haswell, probably a result of the four-fold decrease in Icache misses.
I agree with you that a improvement can be due to a better cache architecture on Has well.One question what are the main hotspots of your program?If they are a loops do they have predictable indexing?
Which core your application mostly used?Did you set affinity to specific logical CPU?.It is strange that only your app timer threads were only scheduled to run and thus interrupting main thread of execution.If one of your threads was scheduled to run on cpu0 which is used mainly to execute ISR and DPC in such a situation the counters reading could be skewed.