I just installed icc 11.1.072 on a dual 6-core Intel Xeon X5680 Linux system. My initial runs were disappointing as the code generated by the icc compiler ran slower than the one generated by gcc 4.3.4 on a slower dual quad-core Nehalem machine. My code is a single-precision FLOP-intensive code, parallelized with pthreads, and uses SSE vector intrinsics. I dont have gcc numbers on this machine yet (I am installing gcc 4.5.0 as I type).
I am using the following flags for compiling using gcc
Replace the flag ipo with: -fast (single ipo). inform the distribution is used. Flush all caches (sys and wm) use top to evaluate consumption when your program is running. compare GNU /GCC ,resulting consumption. for show if some parameters /proc/sys/kernel unadapted.
As you didn't set a software prefetch option for gcc, you would not want one for icc. I'm not certain about the unroll options. When I find a loop which I want unrolled by 4, I precede it by #pragma unroll(4). icc often unrolls well enough by default to match the performance you get with those gcc options, but occasionally will be improved by the pragma or by the -unroll4 option. According to the docs, -unroll-aggressive applies only to the case where a loop has a fixed count and may be unrolled completely. The dual 6-core machine is more dependent on affinity than the dual 4-core, more so if HyperThreading is enabled. According to a recent post, you can engage the KMP_AFFINITY library which comes with icc by making an appropriate omp function call in a preliminary short parallel region and have the thread placement respond to your KMP_AFFINITY setting. You would want to try both 1 and 2 threads per core, keeping threads which share memory as much as possible on the same CPU package. The default OpenMP setting of 24 threads with no affinity persistence is likely to disappoint. It is possible that performance on the 6 core machine when cutting back to the same number of threads you preferred on the 4 core machine might be enhanced by setting affinity to cores 0,2,4,5 on each CPU package, in order to use the full DCU (L1 cache) bandwidth. A very few cases have been observed where the 6 core machine lost as much as 10% performance in comparison with 4 core, but that is an unusual situation. More usual is the problem with increased sensitivity to optimization of affinity.
Hi (With hope that this add can help you better to your initial request)
I don't know if is exactly your hardware I have request to friends information about series System x3550 M3 no problem (Linux) (very good machine) with ICC and GNU compiler this hardware Also I have already using same model for Cloud server,very good performance. (But I never use the system that have machine origin ,always I remount all new ..)
This type machine is not conventional for default parameters system require you read this link and other showed, for adjusted correctly and find where can be better.
Probably you understand how many your machine and your Intel processors model are well... This type machine is an dream for programming..problem is just to find customer that have the money for buy.
If parameter system are unadapted , all flag compiler possible probably result blank or very less, with ICC or GNU compiler or other...
I think small probability that you obtain an difference significant improved with one of two compiler (precisely subject : programing pthread side bottom level) GNU compiler 4.5.0 is also an jewel for to performance ... Good luck...