Codes performance different on windows and linux

joey_hylton · ‎03-30-2010

I have two similar serial codes A and B. The main difference is that I use some vector operations (such as vec1*vec2, sum(vec3)) in B to replace some large do loops in A.

When I compile and run the codes in windows, B is a little bit faster than A.

However, when I compare and run them on a linux machine, B is about 10--20 times slower than A.

Could someone please have a wild guess the possible reason? Thanks.

Here are some informations:

Windows vista 64t with intel XeonCPU E5530 @2.27GHz 2.26GHz (2 procs)

Intel Fortran 11.1.048 , and compile flags are

/nologo /heap-arrays1000 /extend_source:132 /fpe:0 /Qfp-speculation=safe /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /libs:static /threads /c

Linux system with intel Xeon CPU E5410 @ 2.33GHz

Intel Fortran 11.0.083 :

make FFLAGS="-warn general -132 -heap-arrays 1000 -O2" \\

CFLAGS=" -O2" \\

CXXFLAGS=" -O2" \\

LFLAGS="-static " \\

FULLLIBS="-L/opt/intel/Compiler/11.0/083/mkl/lib/em64t /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_solver_lp64_sequential.a -Wl,--start-group /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_sequential.a /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_core.a -Wl,--end-group /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_intel_thread.a -limf -lm " \\

ARFLAGS=" -ruv" \\

OS="lnxem64" DF=lnxem64\\

CMD="SERIAL" pixelinv

make[1]: Entering directory `SERIAL'

mpif90 -warn general -132 -heap-arrays 1000 -O2 -static -c M.F

mpif90 -warn general -132 -heap-arrays 1000 -O2 -static -c S.F90

S.F90(922): (col. 10) remark: LOOP WAS VECTORIZED.

S.F90(926): (col. 10) remark: LOOP WAS VECTORIZED.

Martyn_C_Intel · ‎03-30-2010

If your two codes use a lot of memory and have different memory access patterns, (eg if the second involves a lot of temporary array copies), you might be sensitive to differences between the two systems - the Linux one might have less memory and be paging more, and it has a much lower memory bandwidth than the Windows system.
You are using different compilers on the two systems. I would compare carefully the vectorization reports between the two, especially for B, and see if any important loops are vectorized in one case but not the other. (-vec-report2 or /Qvec-report2). You might also turn on the report for higher level loop optimizations with -opt-report-phase hlo (/Qopt-report-phase hlo) and compare. Perhaps the intrinsic or Fortran90 array notation is not being optimized as effectivelywith the older compiler.
This is a big factor, though...

jimdempseyatthecove · ‎03-31-2010

Maybe you are looking in the wrong place.

Does the program have an initialization phase where it is reading in data and/or performing a large number of allocations?

Does the program have a termination phase where it is writing out data and/or performing a large number of deallocations?

If so, or unknown, try inserting timer code were you obtain the time _after_ initialization, and then again _before_ termination phase. IOW time just the computational part of the application.

10-20 times slower for B relative to A on the same machine (but different platform) cannot be accounted for by platform UNLESS one system experiences better cache hitsfor B relative to A than the other system.

E5530 4 core w/HT L3 1x8MB, L2 4x256KB
E54104 core wo/HT L2 2x6MB

The caches are significantly different. If you can eliminate the time to initialize and shut down the program and still see the A/Brelative performance difference then it might be good to run VTune or other profiler capable of looking at cache hit/miss data.

Jim Dempsey

joey_hylton · ‎03-31-2010

Martyn and Jim, Thanks for your reply.

As you mentioned, the memory may be the reason. The time I compared in my original post are only the massive computation part. However, the code A and B use different schemes. The basis math is to calculate the matrix-vector multiplication: S(n)=J(m,n)^T J(m,n) V(n). For code A, I only use two do-loops:

do i=1,m

do j=1,n

jv(i)=jv(i)+J(i,j)*v(j)

enddo

do j=1,n

do i=1,m

s(j)=s(j)+J(i,j)*jv(i)

enddo

Sincem and n are two large numbers, the memory usage is very large. So I change the calculation by using the relationship J(m,n)=F(n,p)xH(n,q) with m=pxq, and S(n)= H x F * F x H * V. So the memory saving is large. In the multiplications, I always use something as Hv(1:n)=H(1:n,i)*V(1:n) with the loop for i.

In my testing, I have two cases:

case 1 : A uses 3.7GB, and B uses 700MB. The windows has 16GB memory and Linux has 64 GB memory.

Case 2: A used 40.5GB, and B uses 2.5GB. Both windows and Linux machines have 64GB memory.

Both of the cases have the same performance differences as I mentioned in my original post.

Would you like to further analyse my problem? Thanks a lot.