v9 vs. v8.1 speed comparison - is v9 slower?

jbugden · ‎08-23-2005

I've tried v9 on a large set of scientific applications and appear to have a fairly consistent 5% slowdown across the entire set.

With PGO + SSP in use v9 is almost, but not quite, as fast as v8.1 with PGO alone.

Anyone else encounter code that runs slower on v9?

Are there particular practises to avoid?

maverick6664 · ‎08-29-2005

In my experience with SETI@home, ICC 9.0 is always about 10% faster than ICC 8.1. W/o PGO/ssp

jbugden · ‎08-31-2005

I've submitted a test case to Premier Support which has a couple of cases where v8.1 is 3X faster than v9. On many more cases, v9.0 is faster than v8.1, but if you have the right mix of cases (and it appears that we do) then you will find v9 slower than v8.1.

More importantly, I find that the performance varies radically depending on how the memory has been allocated, a factor you would expect to be irrelevant (modulo alignment).

TimP · ‎08-31-2005

If you're seeing cache aliasing or DTLB issues, neither compiler deals with those. Particularly when running on a processor which is subject to 64K aliasing, you could easily encounter such issues by accident, when changing compiler, or modifying your source code.
As you mentioned data alignment, the 32-bit icc/icpc don't pay any attention to vector alignment, except where you so specify (by declspec or aligned malloc and the like). Vectorization of mis-aligned operands could reduce performance. Even with correct alignment, the -O1 space saving vectorization would perform better on short vectors. Windows malloc() doesn't even align 64-bit data types properly. I note that gnu and Microsoft compilers continue to make more allowances for mis-aligned 64-bit data types than icc/icpc do.

jbugden · ‎08-31-2005

I've submitted a test case to Premier Support (Issue #322213) that showcased these discrepancies.

The identical code produces significantly different execution times for virtually identical routines depending on how memory used by these routines is allocated (e.g. static vs. MS malloc vs. MS _aligned_malloc vs. doug lea's (glibc) malloc). The range of outcomes is not explained by alignment alone, though patterns exist. For example, timings suggest that _aligned_malloc aligned to one power-of-2 greater than requested (i.e. 2^(n+1)), based on a comparison of execution times with doug lea's malloc aligned at a power-of-2.

I'd love to understand what 'breaks' the optimizer on apparently identical pieces of code.

In particular, the implementation using static allocation is 3X faster on v8.1 than it is on v9. It is in fact faster than all other methods by either compiler, though all other methods on v9 are faster than their corresponding one on v8.1.

Cache and DTLB penalties should be similar for all versions. Note that all datasets are intentionally much larger than the L2 cache.