Performance problems with porting from HP-UX (Intel IA-64) to Intel X86_64 based on Linux RHEL4

atix · ‎05-29-2008

Hello,
we are currently migrating an HPC application from HP-UX (Intel IA-64) to Intel C/C++/Fortran Platform based on Linux RHEL4(x86_64).

The whole migration worked quite well in terms of compiling and running the code.

Now the problem is the performance difference between HP-UX and Intel Linux X86_64. The execution of the same code under HP-UX runs 2x faster then under Linux. Although the switching from -O2 to -O3 -fast or -ax/-x didn't give us any benefits.

So the question is are we missing anything genaral in terms of compiling the code. Some switches/libraries that should be generally used.

More infos see below.

Any ideas where and how we can proceed?

Did we miss something?

Regards and thanks in advance.

Information:
Application:
# ldd bin/x86_64/hpcprogram_linux-01
libifcoremt.so.5 => /opt/intel/fce/10.1.015/lib/libifcoremt.so.5 (0x0000002a95557000)
libifport.so.5 => /opt/intel/fce/10.1.015/lib/libifport.so.5 (0x0000002a9578b000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000325a700000)
libiomp5.so => /opt/intel/fce/10.1.015/lib/libiomp5.so (0x0000002a958d1000)
libm.so.6 => /lib64/tls/libm.so.6 (0x00000039bae00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000321af00000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003219000000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003218500000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003d23e00000)
libimf.so => /opt/intel/fce/10.1.015/lib/libimf.so (0x0000002a95a4b000)
libintlc.so.5 => /opt/intel/fce/10.1.015/lib/libintlc.so.5 (0x0000002a95dae000)
/lib64/ld-linux-x86-64.so.2 (0x0000003218300000)

Compiler options:
FOPT2 = -O3 -ftz -w95 -c -p -parallel -prefetch -vec-guard-write -unroll-aggressive -xW -axW

CC = icc
CXX = icpc
CPP = icc
F90 = ifort
LINKER = icc
CFLAGS = -DAPM_HOST=CLYDE -Dmach_$(HOSTTYPE)
CXXFLAGS = -Dmach_$(HOSTTYPE)
CPPFLAGS = -E -Dmach_$(HOSTTYPE)
F90FLAGS = -cpp -module $(OBJ) -Dmach_$(HOSTTYPE) -assume nounderscore -threads -reentrancy threaded -fpic

LDFLAGS = /opt/intel/fce/10.1.015/lib/for_main.o -p -pthread -L/opt/intel/fce/10.1.015/lib -lifcoremt -lifport -m64 -lstdc++ -liomp5

System:
# rpm -qa | grep glibc
glibc-devel-2.3.4-2.36
glibc-2.3.4-2.36
compat-glibc-2.3.2-95.30
glibc-2.3.4-2.36
glibc-kernheaders-2.4-9.1.100.EL
glibc-devel-2.3.4-2.36
compat-glibc-headers-2.3.2-95.30
compat-glibc-2.3.2-95.30
glibc-common-2.3.4-2.36
glibc-headers-2.3.4-2.36

# uname -a
Linux node 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 12 17:58:20 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family ; : 15
model : 4
model name : Intel Xeon CPU 2.66GHz
stepping : 8
cpu MHz : 2669.000
cache size : 1024 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est cid cx16 xtpr
bogomips : 5345.97
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:

.. 7 more CPUs to come ..

TimP · ‎05-29-2008

If you are depending on -parallel, you should use -par-report options to see if your intended loops are parallelized. If not, you may beluckyin adjusting -par-threshold to gain an advantage. VTune profiling also would show whether you are getting advantage from threading or vectorization.

Unless you need to run on a non-SSE2 platform, you should use -xW only, rather than -axW. If you have vectorizable complex, you should use -xP or newer option.

You should use -fpic only if required (for making .so). If you do require it, and you used export tables for HPUX, you should attempt the same here.

If you are using -prefetch in the hope of doing better than the hardware prefetch, you might try shutting off the latter. -prefetch is unlikely to improve on hardware prefetch, unless it is successful for indirection like x(i(:)).

Without moreinformation, it's impossible to guess what relative performance of your IA-64 and Xeon machines should be. I am guessing you have a quad core Xeon 1333FSB, in which case many applications will not get much benefit from using 8 cores rather than 4. If your application uses double precision, pipelines well on IA-64 but doesn't vectorize on Xeon, parallizes well on HPUX, and is memory reference intensive, it is likely that Xeon will be slower than a recent IA-64. For memory reference intensive applications, the Xeonquad core 1600FSB may run 20% faster than the 1333, as indicated by the chipset rating.

atix · ‎06-02-2008

We should add two information that might help to understand our problem better.

- First we use Double Core CPUs not Quad Cores.

- Second we gain a factor of 1.5 times Linux(Xeon, x86_64) being faster then HP-UX(IA-64) when compiling with -O0. That also is a reason why we think it should be possible to get it faster. In contrary with -O3 we get a factor of 2 times HP-UX(IA-64) being faster then Linux(Xeon, x86_64)

Does that help?

Thanks in advance for your help.

TimP · ‎06-02-2008

You can't hope to get the same speedup from -O0 to -O3 on Xeon as on IA-64. The architectures are entirely different. You would have to investigate specifics of your application to see whether you approach optimum performance on Xeon. You haven't even answered whether your application is vectorizable effectively for Xeon.

atix · ‎06-03-2008

I feared you would say that we cannot expect that kind of performance advantage just from switching O0 to O3. Got it.

That question about vectorization is from my point of view not very easy to answer. We are having one Fortran file doing 73% percent of the load of the program (that was what gprof told us):
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
72.89 77.79 77.79 5433 0.01 0.01 smatvec
5.59 83.76 5.97 2037280 0.00 0.00 eigen

The smatvec file could be vectorized four times like follows:
ifort -cpp -Dmach_x86_64 -assume nounderscore -threads -reentrancy threaded -DNODEBUG -O3 -ftz -w95 -c -parallel -prefetch -vec-guard-write -unroll-aggressive -xW -axW -p -Isrc -c src/smatvec.f -o obj/opt_x86_64/smatvec.o
src/smatvec.f(24): (col. 2) remark: LOOP WAS VECTORIZED.
src/smatvec.f(44): (col. 2) remark: LOOP WAS VECTORIZED.
src/smatvec.f(49): (col. 5) remark: LOOP WAS VECTORIZED.
src/smatvec.f(55): (col. 5) remark: LOOP WAS VECTORIZED.

Whereas we have more for clauses like in line 44:
48 do j=myid+1,nthreads-1
49 do i=max(iequal(myid+1), irange(1,j+1)),
50 * min(iequal(myid+2)-1,irange(2,j+1))
51 w(i)= w(i) + wrk3(j*isys + i)
52 end do
53 end do

Does that give you a better idea?

Thanks again very much for your help.

TimP · ‎06-03-2008

If most of the time is spent in loops like this, you should be getting a significant advantage from vectorization. If the loops are long enough, depending on the characteristics of your platform, you should get additionalgain from threading the outer loop. As you say you have chosen a dual core platform with 8 cores (implying 4 sockets), I take it that youshould experiment with the number of threads, to find the optimum number, once you assure that the important loops are threaded. Past 4 socket Xeon platforms may not have enough memory bandwidth to use 8 vectorized threads effectively, even if you have enough RAM. It's even possible for a dual socket quad core 1600MT/s FSBto out-perform 4 socket dual core, depending on the application.

If it is threaded, it may be important to try the KMP_AFFINITY environment variable settings, particularly if you have a very old version of EL4. For example, if you set OMP_NUM_THREADS=4, you can set 1 thread per socket by KMP_AFFINITY=scatter, thus giving each thread an entire cache and memory bus. When trying 8 threads, you might be interested in investigating whether you have paired theads properly on sockets, and whether it matters. EL4 doesn't give you the facilities of current distros to investigate this.

If your loops haven't been threaded automatically, and you don'tfind par_threshold satisfactory, you may find OpenMP parallelization worth while.

atix · ‎06-04-2008

Ok as soon as we have a running compiler again we will focus on this points.

Right now we are running out of evaluation license period and therefore cannot provide new data. I know it's off topic but is there any way to get another 30days of eval?

Steven_L_Intel1 · ‎06-04-2008

Submit a support request to Intel Support asking for an extension.