- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
we are currently migrating an HPC application from HP-UX (Intel IA-64) to Intel C/C++/Fortran Platform based on Linux RHEL4(x86_64).
The whole migration worked quite well in terms of compiling and running the code.
Now the problem is the performance difference between HP-UX and Intel Linux X86_64. The execution of the same code under HP-UX runs 2x faster then under Linux. Although the switching from -O2 to -O3 -fast or -ax/-x didn't give us any benefits.
So the question is are we missing anything genaral in terms of compiling the code. Some switches/libraries that should be generally used.
More infos see below.
Any ideas where and how we can proceed?
Did we miss something?
Regards and thanks in advance.
Information:
Application:
# ldd bin/x86_64/hpcprogram_linux-01
libifcoremt.so.5 => /opt/intel/fce/10.1.015/lib/libifcoremt.so.5 (0x0000002a95557000)
libifport.so.5 => /opt/intel/fce/10.1.015/lib/libifport.so.5 (0x0000002a9578b000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x000000325a700000)
libiomp5.so => /opt/intel/fce/10.1.015/lib/libiomp5.so (0x0000002a958d1000)
libm.so.6 => /lib64/tls/libm.so.6 (0x00000039bae00000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x000000321af00000)
libpthread.so.0 => /lib64/tls/libpthread.so.0 (0x0000003219000000)
libc.so.6 => /lib64/tls/libc.so.6 (0x0000003218500000)
libdl.so.2 => /lib64/libdl.so.2 (0x0000003d23e00000)
libimf.so => /opt/intel/fce/10.1.015/lib/libimf.so (0x0000002a95a4b000)
libintlc.so.5 => /opt/intel/fce/10.1.015/lib/libintlc.so.5 (0x0000002a95dae000)
/lib64/ld-linux-x86-64.so.2 (0x0000003218300000)
Compiler options:
FOPT2 = -O3 -ftz -w95 -c -p -parallel -prefetch -vec-guard-write -unroll-aggressive -xW -axW
CC = icc
CXX = icpc
CPP = icc
F90 = ifort
LINKER = icc
CFLAGS = -DAPM_HOST=CLYDE -Dmach_$(HOSTTYPE)
CXXFLAGS = -Dmach_$(HOSTTYPE)
CPPFLAGS = -E -Dmach_$(HOSTTYPE)
F90FLAGS = -cpp -module $(OBJ) -Dmach_$(HOSTTYPE) -assume nounderscore -threads -reentrancy threaded -fpic
LDFLAGS = /opt/intel/fce/10.1.015/lib/for_main.o -p -pthread -L/opt/intel/fce/10.1.015/lib -lifcoremt -lifport -m64 -lstdc++ -liomp5
System:
# rpm -qa | grep glibc
glibc-devel-2.3.4-2.36
glibc-2.3.4-2.36
compat-glibc-2.3.2-95.30
glibc-2.3.4-2.36
glibc-kernheaders-2.4-9.1.100.EL
glibc-devel-2.3.4-2.36
compat-glibc-headers-2.3.2-95.30
compat-glibc-2.3.2-95.30
glibc-common-2.3.4-2.36
glibc-headers-2.3.4-2.36
# uname -a
Linux node 2.6.9-55.0.2.ELsmp #1 SMP Tue Jun 12 17:58:20 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family ; : 15
model : 4
model name : Intel Xeon CPU 2.66GHz
stepping : 8
cpu MHz : 2669.000
cache size : 1024 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm pni monitor ds_cpl est cid cx16 xtpr
bogomips : 5345.97
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:
.. 7 more CPUs to come ..
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you are depending on -parallel, you should use -par-report options to see if your intended loops are parallelized. If not, you may beluckyin adjusting -par-threshold to gain an advantage. VTune profiling also would show whether you are getting advantage from threading or vectorization.
Unless you need to run on a non-SSE2 platform, you should use -xW only, rather than -axW. If you have vectorizable complex, you should use -xP or newer option.
You should use -fpic only if required (for making .so). If you do require it, and you used export tables for HPUX, you should attempt the same here.
If you are using -prefetch in the hope of doing better than the hardware prefetch, you might try shutting off the latter. -prefetch is unlikely to improve on hardware prefetch, unless it is successful for indirection like x(i(:)).
Without moreinformation, it's impossible to guess what relative performance of your IA-64 and Xeon machines should be. I am guessing you have a quad core Xeon 1333FSB, in which case many applications will not get much benefit from using 8 cores rather than 4. If your application uses double precision, pipelines well on IA-64 but doesn't vectorize on Xeon, parallizes well on HPUX, and is memory reference intensive, it is likely that Xeon will be slower than a recent IA-64. For memory reference intensive applications, the Xeonquad core 1600FSB may run 20% faster than the 1333, as indicated by the chipset rating.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- First we use Double Core CPUs not Quad Cores.
- Second we gain a factor of 1.5 times Linux(Xeon, x86_64) being faster then HP-UX(IA-64) when compiling with -O0. That also is a reason why we think it should be possible to get it faster. In contrary with -O3 we get a factor of 2 times HP-UX(IA-64) being faster then Linux(Xeon, x86_64)
Does that help?
Thanks in advance for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That question about vectorization is from my point of view not very easy to answer. We are having one Fortran file doing 73% percent of the load of the program (that was what gprof told us):
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
72.89 77.79 77.79 5433 0.01 0.01 smatvec
5.59 83.76 5.97 2037280 0.00 0.00 eigen
The smatvec file could be vectorized four times like follows:
ifort -cpp -Dmach_x86_64 -assume nounderscore -threads -reentrancy threaded -DNODEBUG -O3 -ftz -w95 -c -parallel -prefetch -vec-guard-write -unroll-aggressive -xW -axW -p -Isrc -c src/smatvec.f -o obj/opt_x86_64/smatvec.o
src/smatvec.f(24): (col. 2) remark: LOOP WAS VECTORIZED.
src/smatvec.f(44): (col. 2) remark: LOOP WAS VECTORIZED.
src/smatvec.f(49): (col. 5) remark: LOOP WAS VECTORIZED.
src/smatvec.f(55): (col. 5) remark: LOOP WAS VECTORIZED.
Whereas we have more for clauses like in line 44:
48 do j=myid+1,nthreads-1
49 do i=max(iequal(myid+1), irange(1,j+1)),
50 * min(iequal(myid+2)-1,irange(2,j+1))
51 w(i)= w(i) + wrk3(j*isys + i)
52 end do
53 end do
Does that give you a better idea?
Thanks again very much for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If most of the time is spent in loops like this, you should be getting a significant advantage from vectorization. If the loops are long enough, depending on the characteristics of your platform, you should get additionalgain from threading the outer loop. As you say you have chosen a dual core platform with 8 cores (implying 4 sockets), I take it that youshould experiment with the number of threads, to find the optimum number, once you assure that the important loops are threaded. Past 4 socket Xeon platforms may not have enough memory bandwidth to use 8 vectorized threads effectively, even if you have enough RAM. It's even possible for a dual socket quad core 1600MT/s FSBto out-perform 4 socket dual core, depending on the application.
If it is threaded, it may be important to try the KMP_AFFINITY environment variable settings, particularly if you have a very old version of EL4. For example, if you set OMP_NUM_THREADS=4, you can set 1 thread per socket by KMP_AFFINITY=scatter, thus giving each thread an entire cache and memory bus. When trying 8 threads, you might be interested in investigating whether you have paired theads properly on sockets, and whether it matters. EL4 doesn't give you the facilities of current distros to investigate this.
If your loops haven't been threaded automatically, and you don'tfind par_threshold satisfactory, you may find OpenMP parallelization worth while.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Right now we are running out of evaluation license period and therefore cannot provide new data. I know it's off topic but is there any way to get another 30days of eval?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page