icc optimization flags

bordaw · ‎07-09-2010

I just installed icc 11.1.072 on a dual 6-core Intel Xeon X5680 Linux system. My initial runs were disappointing as the code generated by the icc compiler ran slower than the one generated by gcc 4.3.4 on a slower dual quad-core Nehalem machine. My code is a single-precision FLOP-intensive code, parallelized with pthreads, and uses SSE vector intrinsics. I dont have gcc numbers on this machine yet (I am installing gcc 4.5.0 as I type).

I am using the following flags for compiling using gcc

-O3 -std=c99 -msse4 -mtune=native -march=native -funroll-loops --param max-unroll-times=4 -ffast-math

On icc, I am using the following flags

-fast -xSSE4.2 -ipo -no-prec-div -static -opt-prefetch -unroll-aggressive -m64

What flags I need to use to improve the icc performance?

Thanks a lot for your help!

-regards,
Rajesh

aazue · ‎07-09-2010

Hi

Replace the flag ipo with: -fast (single ipo). inform the distribution is used.
Flush all caches (sys and wm)
use top to evaluate consumption when your program is running.
compare GNU /GCC ,resulting consumption. for show if some parameters /proc/sys/kernel
unadapted.

Regards

TimP · ‎07-09-2010

As you didn't set a software prefetch option for gcc, you would not want one for icc.
I'm not certain about the unroll options. When I find a loop which I want unrolled by 4, I precede it by
#pragma unroll(4).
icc often unrolls well enough by default to match the performance you get with those gcc options, but occasionally will be improved by the pragma or by the -unroll4 option.
According to the docs, -unroll-aggressive applies only to the case where a loop has a fixed count and may be unrolled completely.
The dual 6-core machine is more dependent on affinity than the dual 4-core, more so if HyperThreading is enabled.
According to a recent post, you can engage the KMP_AFFINITY library which comes with icc by making an appropriate omp function call in a preliminary short parallel region and have the thread placement respond to your KMP_AFFINITY setting. You would want to try both 1 and 2 threads per core, keeping threads which share memory as much as possible on the same CPU package. The default OpenMP setting of 24 threads with no affinity persistence is likely to disappoint.
It is possible that performance on the 6 core machine when cutting back to the same number of threads you preferred on the 4 core machine might be enhanced by setting affinity to cores 0,2,4,5 on each CPU package, in order to use the full DCU (L1 cache) bandwidth.
A very few cases have been observed where the 6 core machine lost as much as 10% performance in comparison with 4 core, but that is an unusual situation. More usual is the problem with increased sensitivity to optimization of affinity.

bordaw · ‎07-12-2010

Thanks!

Can I use the KMP_AFFINITY library with pthreads?

aazue · ‎07-12-2010

Hi
about:
Can I use the KMP_AFFINITY library with pthreads?

KMP_AFFINITY is not library
is variable environment for drive comportment
processors in tool OpenMp .

Regards

aazue · ‎07-13-2010

Hi (With hope that this add can help you better to your initial request)

I don't know if is exactly your hardware
I have request to friends information about series System x3550 M3
no problem (Linux) (very good machine) with ICC and GNU compiler this
hardware
Also I have already using same model for Cloud server,very good performance.
(But I never use the system that have machine origin ,always I remount all new ..)

This type machine is not conventional for default parameters system
require you read this link and other showed, for adjusted correctly and
find where can be better.

http://knol.google.com/k/linux-performance-tuning-and-measurement#

An good information, for driving easy shell side.
(Really Thank for this user that have wrote this link !..)

If you have time and the practice the better way is you drive directly in source
Use help learning with this file pdf. (pthreads)

http://www.google.fr/url?url=http://publib.boulder.ibm.com/iseries/v5r1/ic2924/info/apis/rzah4mst.pdf&rct=j&sa=X&ei=l7c7TJ3jFdD5OfLRzIoK&ved=0CBwQzgQoADAA&q=pthread+ibm+boulder&usg=AFQjCNFf1HfE2agMDTpsaKOo4KT8D4Lf4w

Probably you understand how many your machine and your Intel processors model are well...
This type machine is an dream for programming..problem is just to find customer
that have the money for buy.

If parameter system are unadapted , all flag compiler possible probably
result blank or very less, with ICC or GNU compiler or other...

I think small probability that you obtain an difference significant improved with
one of two compiler (precisely subject : programing pthread side bottom level)
GNU compiler 4.5.0 is also an jewel for to performance ...
Good luck...

Regards