Anomalous performance differences between MS7 and Linux

dehvidc1 · ‎01-19-2011

I've been doing a project for a few months that started with a model study using ICC on MS7 and after getting good performance improvements I'm now introducing the changes to the Linux production system.

The MS7 code was essentially the product system code with a few changes to input and output to make it easier to port. Functionally it was pretty much identical to the production system.

The MS7 system (dual CPU, quad cores, no hyper-threading) used a slightly earlier ICC version. I'm using ICC 12.0 on Linux. I'm building on a a Dell Cluster where the build node is twin CPU's, quad physical cores with hyperthreading. The cluster nodes I'm timing on are twin CPU's, six physical cores with hyperthreading.

As the initial Linux timing numbers jumped around (the logical cores were all running processes) I'm timing with one process per node.

I'm not seeing the same compiler timing improvements I saw on MS7.

On MS7 I had the following averages in seconds (baseline was extrapolated from gcc production system):

Baseline 1152
O2 895
O3 828
O3, fp:fast=2 780
O2, No aliasing, fp:fast=2 629

Total drop of 45% just using compiler switches. All good.

On Linux I have the following averages (absolute numbers shouldn't be compared to MS7, baseline is gcc O2):

Baseline 1591
O2 1191
O3 1264
O2 no alias 1291
O2 fp fast =2 1190
O2 no alias, fp fast=2 1268

So a nice roughly 25% drop using ICC O2 but O3 is slower, fp fast=2 has no impact and no alias also makes the executable run slower!

The nodes have more than enough memory for the datasets I've timed and the CPU's are 100% utilised ie application is compute bound.

The optimisation effects don't look right particularly given my experience on MS7.

Any suggestions?

Regards

David

TimP · ‎01-20-2011

Setting KMP_AFFINITY so as to use at most 1 thread per core, and keep each thread to 1 core, could be useful in timing experiments, even when you limit it to 1 thread. I'm still not convinced that stable performance results can always be achieved on 6 core with HT.
Have you looked at opt-report to check that these option changes actually make a difference at compile time?
My own expectations of some of these options:
fast=2 would invoke complex limited range. Without complex, I don't know why it would be worth while.
O3 invokes loop optimizations such as swaps and fusion so as to promote vectorization. You should see the difference, if any, in the vec-report.
If you have expressions which depend on observance of parentheses, all bets are off with fp:fast. Remember that these options are spelled differently in the linux compiler.
With the 12.0 compiler, many of the optimizations which would be killed by -fp-model source can be restored for an individual for loop by the #pragma simd, so you may be able to get optimizations which are broken by -fp-model fast together with some which are broken by -fp-model source.

dehvidc1 · ‎01-23-2011

Tim: Setting KMP_AFFINITY so as to use at most 1 thread per core, and keep each thread to 1 core, could be useful in timing experiments, even when you limit it to 1 thread. I'm still not convinced that stable performance results can always be achieved on 6 core with HT.

I think I'm going to have to do this. Did a timing run on Friday on a single node with each single-threaded process having the node dedicated ie the process finishes then the next process starts. (Fine way to use all this expensive gear :)

Results were (s)

664

665

557

556

663

Interesting that there were two values about which the timing results clustered. What the scheduler is doing to achieve this spread is intriguing. I woudl prefer to avoid using numactl etc as this means a change to the current production system practices. Much better chance to get changes adopted and to see them used successfully if practice changes can be avoided.

On a wider note I wonder what this is going to do to performance requirements in software delivery contracts. It's common for large government and corporate software contracts to have performance requirements. I guess they will have to have a decent plus or minus tolerance.

Tim: Have you looked at opt-report to check that these option changes actually make a difference at compile time?

As the runtimes changed I assumed the switches were having an impact albeit not the one I was hoping for :)

Tim: My own expectations of some of these options:
fast=2 would invoke complex limited range. Without complex, I don't know why it would be worth while.

Gave a good result on MS7. I'd like to do some instruction level analysis to characterise the differences caused by these switches. Both in terms of the code generated and the difference in numbers of instructions executed. Just a matter of available time.

Tim: O3 invokes loop optimizations such as swaps and fusion so as to promote vectorization. You should see the difference, if any, in the vec-report.

Is that the extent of O3's effects? I thought from the documentation there might be a bit more than that. But it is a bit unclear.

Tim: With the 12.0 compiler, many of the optimizations which would be killed by -fp-model source can be restored for an individual for loop by the #pragma simd, so you may be able to get optimizations which are broken by -fp-model fast together with some which are broken by -fp-model source.

Which optimisations would be killed by fp-model fast?

Thanks

David

TimP · ‎01-23-2011

Quoting dehvidc1

Tim: Setting KMP_AFFINITY so as to use at most 1 thread per core, and keep each thread to 1 core, could be useful in timing experiments, even when you limit it to 1 thread. I'm still not convinced that stable performance results can always be achieved on 6 core with HT.

I think I'm going to have to do this. Did a timing run on Friday on a single node with each single-threaded process having the node dedicated ie the process finishes then the next process starts. (Fine way to use all this expensive gear :)

Results were (s)

664

665

557

556

663

Interesting that there were two values about which the timing results clustered. What the scheduler is doing to achieve this spread is intriguing. I woudl prefer to avoid using numactl etc as this means a change to the current production system practices. Much better chance to get changes adopted and to see them used successfully if practice changes can be avoided.

Tim: With the 12.0 compiler, many of the optimizations which would be killed by -fp-model source can be restored for an individual for loop by the #pragma simd, so you may be able to get optimizations which are broken by -fp-model fast together with some which are broken by -fp-model source.

Which optimisations would be killed by fp-model fast?

That type of occasional slower timing on a single threaded benchmark often means that the job was moving among caches. In principle, the scheduler should always resume a task on the same core from which it was suspended, whenever possible, but it will move if something has grabbed the preferred core. When you have linked with libiomp5 (openmp or parallel), KMP_AFFINITY works about the same as taskset or numactl. I have observed cases where this avoids those longer timings on short single thread runs.

-fp-model fast enables vectorization of sum reduction, and some other less common reductions, which are disabled by standard compliance options such as -fp-model source. The #pragma simd reduction, introduced in the current icc, optimizes sum and dot product reductions, and sometimes max/min reductions, over-riding the fp-model setting, and even optimizing a few which are missed by -fp-model fast. Unfortunately, #pragma simd doesn't apply to inner_product() or accumulate(), so, if those don't optimize, you may need to drop back to C code when you need the pragma.

If you have carefully selected an optimum order of evaluation for an expression (e.g. by parentheses), fast will ignore it. fast slows down certain public benchmarks by 30% or so, while speeding up others by a larger amount. A result from fast as bad as this other active thread fortunately is rare.

dehvidc1 · ‎01-23-2011

Using numactl to tie the process to a specific logical core gave much better reproducibility (times in s):

573

572

573

572

573

Now to see how it behaves when all physical cores are occupied (1 process/physical core) and when all logical cores are occupied (2 process/physical core). I hope I don't have to tie memory as well.

aazue · ‎01-24-2011

Hi
About Your remark
Using numactl to tie the process to a specific logical core gave much better reproducibility (times in s):

I use only programing threading control bottom level not KMP_AFFINITY .( with Linux)
Maybe.. simply, when you call 1 thread peer cores you increase probability crossing
with system.(ksoftirqd); so the same time repeated could be more aleatory ,i think..
Is same process existing with (svchost) side Microsoft ??
Powershell (ps) show only cpu(s) also svchost (percent blank) ???
I suspect that process svchost working with using affinity systematically if existing
but I am not realy sure...
Regards

dehvidc1 · ‎01-24-2011

I've only done the numactl timings on Linux. I tied the process to a logical core on the 2nd CPU as I think - this is just from observation; no idea if this is general behaviour - the scheduler tends to put the OS process(es) on the first CPU.

Regards

David

aazue · ‎01-24-2011

Hi
In level pid ,I think that you having never the hand processor perfectly with operating system.
only that have decided the kernel at instant semaphore. When you call precise affinity , you have an time
latency waiting if busy. you can having several program in same times for use same specific affinity.
You know probably already that you having some files in /proc/sys/kernel for change dynamically
(from the source) comportment of the system,
but this task is very complex..
I have wrote an program analyzer supposed align better automatically the parameter with specific
machine this side ,but with not constant time tasks,
I have not really find how make result well . Probably ,only .. , i am too old now...
Regards