Re: Comparing performance on different intel architectures

Dny · ‎05-21-2009

Hello,

I want to compare the performance of some applications on Clovertown and Nehalem. I observe that not all the application gives good performance improvement on Nehalem as compare to Clovertown. It range from 0 to 10 % for most of application and very few application shows more than 50 % improvement.

So I want to find out why there is so large differance between the performance improvement.

I tried sampling the application using VTune on Nehalem as well as on Clovertown, but I observe that their is very difference in the VTune events supported by these architectures, so how to compare the different sampling number gathered from Vtune.

E.g. Clovertown has 4 MB of L2 cache and Nehalem has 256 KB of L2 and 8M of L3
So we cant compare the L2 sampling events of Clovertown and Nehalem.

My question is now how would I compare the performance improvement difference bet different applications wrt Clovertown and Nehalem?

Thanking you,

Regards,
Dny

TimP · ‎05-21-2009

Quoting - Dny

I want to compare the performance of some applications on Clovertown and Nehalem. I observe that not all the application gives good performance improvement on Nehalem as compare to Clovertown. It range from 0 to 10 % for most of application and very few application shows more than 50 % improvement.

Evidently, you figured out how to use clock ticks to compare time spent in hot spots. You're correct, that CTN L2 cache events can't always be compared directly with corresponding events, such as L3 events, on Nehalem. In fact, where Nehalem doesn't speed up, there may be events, countable by VTune or not, which are more significant on the newer CPU. So, you will want to examine hot spots where there may be potential improvement, and you may be wasting your time looking at Clovertown events.
Some common issues and remedies:

DTLB misses on Nehalem invalidate L1 and L2 cache, sometimes in situations where only L1 would be invalidated on Clovertown. Remedies are to get better data locality or to thread the code.

Nehalem may be more dependent on vectorization, and may even benefit from splitting off vectorizable bits of code where Clovertown did not. Neither CPU has direct counters for relevant events, to my knowledge; I would like to know how if I am wrong.

Fill buffer thrashing, which the compiler would correct automatically where it performs vectorization, may actually cause Nehalem to run slower than Clovertown. This is where you alternately store among too many partial cache lines, as when you store to 10 or more arrays per loop. There is a resource busy stall event which would go high in such cases, but it's an ambiguous indicator.

Store and reload, even though the data objects match so that there is no adverse stall event, are a frequent cause of Nehalem not speeding up. This can be corrected by dictating scalar replacement in the source code, by writing in a scalar temporary. gnu and Sun compilers perform these optimizations automatically in many situations where Intel and Microsoft compilers don't, provided that appropriate restrict qualifiers are used in C99 (or g++) code. Of course, you fixed any bugs in your code which prevent you from using icc -ansi-alias, I assume? There are methods for measuring latencies in VTune, but you should be able to see these store and reload latencies in the asm clock tick view.

"optimizations" specific to the SSSE3 option are favorable for Clovertown, but may be unfavorable for Nehalem. If you wish to get better performance on Nehalem and AMD CPUs instead, you would use SSE3. This pertains to parallel SSE code, where SSSE3 makes special efforts to construct unaligned 16-byte data objects from 2 aligned pieces. Nehalem handles this with a new hidden hardware register, and does it better with SSE2 or SSE3 instructions.

Dny · ‎05-24-2009

Quoting - tim18

Evidently, you figured out how to use clock ticks to compare time spent in hot spots. You're correct, that CTN L2 cache events can't always be compared directly with corresponding events, such as L3 events, on Nehalem. In fact, where Nehalem doesn't speed up, there may be events, countable by VTune or not, which are more significant on the newer CPU. So, you will want to examine hot spots where there may be potential improvement, and you may be wasting your time looking at Clovertown events.
Some common issues and remedies:

DTLB misses on Nehalem invalidate L1 and L2 cache, sometimes in situations where only L1 would be invalidated on Clovertown. Remedies are to get better data locality or to thread the code.

Nehalem may be more dependent on vectorization, and may even benefit from splitting off vectorizable bits of code where Clovertown did not. Neither CPU has direct counters for relevant events, to my knowledge; I would like to know how if I am wrong.

Fill buffer thrashing, which the compiler would correct automatically where it performs vectorization, may actually cause Nehalem to run slower than Clovertown. This is where you alternately store among too many partial cache lines, as when you store to 10 or more arrays per loop. There is a resource busy stall event which would go high in such cases, but it's an ambiguous indicator.

Store and reload, even though the data objects match so that there is no adverse stall event, are a frequent cause of Nehalem not speeding up. This can be corrected by dictating scalar replacement in the source code, by writing in a scalar temporary. gnu and Sun compilers perform these optimizations automatically in many situations where Intel and Microsoft compilers don't, provided that appropriate restrict qualifiers are used in C99 (or g++) code. Of course, you fixed any bugs in your code which prevent you from using icc -ansi-alias, I assume? There are methods for measuring latencies in VTune, but you should be able to see these store and reload latencies in the asm clock tick view.

"optimizations" specific to the SSSE3 option are favorable for Clovertown, but may be unfavorable for Nehalem. If you wish to get better performance on Nehalem and AMD CPUs instead, you would use SSE3. This pertains to parallel SSE code, where SSSE3 makes special efforts to construct unaligned 16-byte data objects from 2 aligned pieces. Nehalem handles this with a new hidden hardware register, and does it better with SSE2 or SSE3 instructions.

Hello Sir,
Thanks for your reply.
Can you please explain in more details about what is "Fill buffer thrashing" and "Store and reload object"?
How could it affect the performance on Nehalem and how to minimize their effect ?
What about using SSE4.2 on Nehalem , Is it better than SSSE3 on Clovertown?

Thanking you,

Regards,
Dny

TimP · ‎05-25-2009

I notice that if you put "fill buffer" in the search window, you turn up mostly references to previous forum discussions. I don't really understand why it hasn't been documented better.
All IA CPUs, beginning with P4, used some write combining buffer scheme; since Merom, it has been called "fill buffer." The scheme was not invented by Intel; the later SPARC chips appeared to have a scheme with similar effect.
When data are written, they are accumulated in cache-line sized fill buffers, in order to collect full cache lines before updating cache and memory. If your loop alternates writing among more cache lines than the fill buffers can accommodate, fill buffers must be flushed prematurely, so that a buffer can be re-used for another cache line. This may produce a thrashing situation, where, instead of being an advantage, the fill buffer scheme hurts performance.
I brought this up in the context of your question, since fill buffer thrashing is one of the observed situations where code doesn't speed up as it should on the "Nehalem" CPUs.
The remedy is to split up loops so they don't write into too many partial cache lines. Where you are filling multiple arrays, it should be advantageous to organize loops so that 6 to 9 arrays are stored by each loop. If using HyperThreading, the resources are shared competitively by both logical processors on a core. When Intel compilers perform auto-vectorization, they make automatic decisions to optimize loop splitting, called "distribution." There are also DISTRIBUTE POINT pragmas to direct the compiler where to split or not split loops.
If you find the original Intel recommendation for programming P4, they said you should not store into more than 2 array segments per loop. That would give the best chance of seeing an advantage for HyperThreading on that CPU, but usually the loss from splitting the loops is more than could be regained by HyperThreading. On more recent CPUs, even where you optimize for HyperThreading on Nehalem, you would not split loops down to less than 4 arrays per loop.
In the biggest case I ran into of Nehalem fill buffer thrashing, there was a potentially vectorizable segment in a very large loop which swapped data among 9 arrays. Changing the source code so that a new loop started with this vectorizable segment encouraged ifort to split the 9 array vector swapping into its own loop. The time spent in that subroutine was reduced by 20% on Harpertown and by 60% on Nehalem.

The SSE4.2 compiler option was introduced to support Nehalem, but it rarely does anything special for Nehalem. It doesn't even remove the special SSSE3 code which is useful for Clovertown but not for Nehalem. You would get the same SSE4.1 optimizations with either the SSE4.1 or SSE4.2 options, and those tend to be more beneficial on Nehalem than on the first SSE4.1 CPUs.
gcc also has -msse4 optimizations which don't work on Clovertown but are useful for Nehalem.
Sun compilers accept sse4_1 and sse4_2 flags, but I never saw them produce code different from sse3. With all compilers, if you wanted to optimize the same binary for Clovertown and Nehalem, you would use sse3, except in the exceptional situation where you have code which could benefit by providing both an ssse3 path and an sse4 path, according to the CPU.
There is no ssse3 optimization in gcc or Sun compilers.
I wrote perhaps too much already about the store-reload penalty for missed scalar replacement on the AVX forum.

srimks · ‎05-25-2009

Quoting - Dny

Hello,

I want to compare the performance of some applications on Clovertown and Nehalem. I observe that not all the application gives good performance improvement on Nehalem as compare to Clovertown. It range from 0 to 10 % for most of application and very few application shows more than 50 % improvement.

So I want to find out why there is so large differance between the performance improvement.

I tried sampling the application using VTune on Nehalem as well as on Clovertown, but I observe that their is very difference in the VTune events supported by these architectures, so how to compare the different sampling number gathered from Vtune.

E.g. Clovertown has 4 MB of L2 cache and Nehalem has 256 KB of L2 and 8M of L3
So we cant compare the L2 sampling events of Clovertown and Nehalem.

My question is now how would I compare the performance improvement difference bet different applications wrt Clovertown and Nehalem?

Thanking you,

Regards,
Dny

The events generated w.r.t Intel Xeon 5560(Nehalem) & Xeon 5345 (Clovertown) are totally differents. L3 is shared cache in 5560 but 5345 has only L1 & L2. So, comparison w.r.t cache structure and EBS events of 5560 & 5345 wouldn't give you any landmark rather focussing the type of applications which is being executed on both of these processors and digging into it would make more worth of your time.

Secondly, analyzing vectorization w.r.t both processors would be the second differences if you can interpret.

Thirdly, try analyzing QPI saturation for 5560 if any.

Fourthly, if you want to see how a program compares on two architectures, the place to start with is performing differential hot spot analysis of the same program running on comparable instances of the architectures, to see how individual functions scale. You might see a uniform scaling or you might see some hot spots get hotter or cooler. Focus on those and drill down to the source code level to find the regions that are taking more or less time within the function. These changing hot spots are the most important to understand, since they'll have the biggest effect on your program. Cycle accounting, stall analysis, it's been called various things, but figuring out what's delaying the instructions is the next step. Though the architectures are different, they are also similar and have similar debug events that may be more or less effective in determining the state of the corresponding stages: all have a front end (instruction decoding) and a back end (resource scheduling, dispatch, retirement), but the number of events of interest, particularly with 5560 processor, is too large to enumerate here. A couple Intel tools provide the means to directly compare runs. Both Intel Parallel Amplifier and PTU offer tools to compare runs. Intel VTune Performance Analyzer 9.1 Update 2 for Linux, contains predefined ratios for 5560 processors.

Fifth, Intel says that 5560 exhibits "Greater Parallelism" by increasing amount of instructions that can be run "out-of-order", which enables more simultaneous processing & overlap latency. Intel has increased the size of out-of-order window & scheduler, giving a wider window. Intel has also increased size of other buffers(probably, Intel folks can be more descriptive on this) in the core to ensure they wouldn't become a limiting factor. I see these factors could be analyzed also w.r.t 5345 processor.

Sixth, 5560 claims for improved "Branch Prediction" by reducing the effective penalty of branch mispredictions overall w.r.t prior processors. Intel has added second level Branch Target Buffer (BTB) for the same. Check this?

Seventh, 5560 claims for improved hardware prefetch & better load-store scheduling which reduces memory access latency. Check this too?

Eight, 5560 removes lot of performance impact of using unaligned instructions, please try checking. There is lot said about fast unaligned loads, cache improvements with 5560, the need is if it can be addressed with types of applications Intel is talking about.

Note: The idea of having inclusive L3 shared cache is to increase performance by reducing traffic to the processor cores, and also to reduce unnecessary core snoops.

Hope you have enough stuffs to compare now.

~BR
Mukkaysh Srivastav

jellyfin · ‎08-20-2009

When does a Core 2 or i7 use one of the 6ish (?) fill buffers (write combining store buffers) vs the 20 or 32 store buffers it has? What is the difference?

Quoting - tim18

I notice that if you put "fill buffer" in the search window, you turn up mostly references to previous forum discussions. I don't really understand why it hasn't been documented better.
All IA CPUs, beginning with P4, used some write combining buffer scheme; since Merom, it has been called "fill buffer." The scheme was not invented by Intel; the later SPARC chips appeared to have a scheme with similar effect.
When data are written, they are accumulated in cache-line sized fill buffers, in order to collect full cache lines before updating cache and memory. If your loop alternates writing among more cache lines than the fill buffers can accommodate, fill buffers must be flushed prematurely, so that a buffer can be re-used for another cache line. This may produce a thrashing situation, where, instead of being an advantage, the fill buffer scheme hurts performance.
I brought this up in the context of your question, since fill buffer thrashing is one of the observed situations where code doesn't speed up as it should on the "Nehalem" CPUs.
The remedy is to split up loops so they don't write into too many partial cache lines. Where you are filling multiple arrays, it should be advantageous to organize loops so that 6 to 9 arrays are stored by each loop. If using HyperThreading, the resources are shared competitively by both logical processors on a core. When Intel compilers perform auto-vectorization, they make automatic decisions to optimize loop splitting, called "distribution." There are also DISTRIBUTE POINT pragmas to direct the compiler where to split or not split loops.
If you find the original Intel recommendation for programming P4, they said you should not store into more than 2 array segments per loop. That would give the best chance of seeing an advantage for HyperThreading on that CPU, but usually the loss from splitting the loops is more than could be regained by HyperThreading. On more recent CPUs, even where you optimize for HyperThreading on Nehalem, you would not split loops down to less than 4 arrays per loop.
In the biggest case I ran into of Nehalem fill buffer thrashing, there was a potentially vectorizable segment in a very large loop which swapped data among 9 arrays. Changing the source code so that a new loop started with this vectorizable segment encouraged ifort to split the 9 array vector swapping into its own loop. The time spent in that subroutine was reduced by 20% on Harpertown and by 60% on Nehalem.

The SSE4.2 compiler option was introduced to support Nehalem, but it rarely does anything special for Nehalem. It doesn't even remove the special SSSE3 code which is useful for Clovertown but not for Nehalem. You would get the same SSE4.1 optimizations with either the SSE4.1 or SSE4.2 options, and those tend to be more beneficial on Nehalem than on the first SSE4.1 CPUs.
gcc also has -msse4 optimizations which don't work on Clovertown but are useful for Nehalem.
Sun compilers accept sse4_1 and sse4_2 flags, but I never saw them produce code different from sse3. With all compilers, if you wanted to optimize the same binary for Clovertown and Nehalem, you would use sse3, except in the exceptional situation where you have code which could benefit by providing both an ssse3 path and an sse4 path, according to the CPU.
There is no ssse3 optimization in gcc or Sun compilers.
I wrote perhaps too much already about the store-reload penalty for missed scalar replacement on the AVX forum.

TimP · ‎08-20-2009

Any store instruction puts data in a fill buffer, of which there are 10 per core on current CPUs. When necessary, the entire fill buffer cache line is flushed to L1 cache (DCU). When HT is active, the threads share the fill buffers competitively. So it may be that an application which needs more than 6 fill buffers per thread will have difficulty running 2 threads per core.
The smaller numbers you refer to apply several architectures back. Going back to Prescott, there were 8 write combine buffers, and it was advisable to use only 6, due to the automatic flushing (to L2) of the 2 least recently used buffers. On that CPU, each logical had access to only half of the write combine buffers, even when only 1 thread per core was active.