Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
29283 Discussions

Codes performance different on windows and linux

joey_hylton
Beginner
690 Views
I have two similar serial codes A and B. The main difference is that I use some vector operations (such as vec1*vec2, sum(vec3)) in B to replace some large do loops in A.
When I compile and run the codes in windows, B is a little bit faster than A.
However, when I compare and run them on a linux machine, B is about 10--20 times slower than A.
Could someone please have a wild guess the possible reason? Thanks.
Here are some informations:
Windows vista 64t with intel XeonCPU E5530 @2.27GHz 2.26GHz (2 procs)
Intel Fortran 11.1.048 , and compile flags are
/nologo /heap-arrays1000 /extend_source:132 /fpe:0 /Qfp-speculation=safe /module:"x64\\Release\\\\" /object:"x64\\Release\\\\" /libs:static /threads /c
Linux system with intel Xeon CPU E5410 @ 2.33GHz
Intel Fortran 11.0.083 :
make FFLAGS="-warn general -132 -heap-arrays 1000 -O2" \\
CFLAGS=" -O2" \\
CXXFLAGS=" -O2" \\
LFLAGS="-static " \\
FULLLIBS="-L/opt/intel/Compiler/11.0/083/mkl/lib/em64t /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_solver_lp64_sequential.a -Wl,--start-group /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_intel_lp64.a /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_sequential.a /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_core.a -Wl,--end-group /opt/intel/Compiler/11.0/083/mkl/lib/em64t/libmkl_intel_thread.a -limf -lm " \\
ARFLAGS=" -ruv" \\
OS="lnxem64" DF=lnxem64\\
CMD="SERIAL" pixelinv
make[1]: Entering directory `SERIAL'
mpif90 -warn general -132 -heap-arrays 1000 -O2 -static -c M.F
mpif90 -warn general -132 -heap-arrays 1000 -O2 -static -c S.F90
S.F90(922): (col. 10) remark: LOOP WAS VECTORIZED.
S.F90(926): (col. 10) remark: LOOP WAS VECTORIZED.
0 Kudos
3 Replies
Martyn_C_Intel
Employee
690 Views

If your two codes use a lot of memory and have different memory access patterns, (eg if the second involves a lot of temporary array copies), you might be sensitive to differences between the two systems - the Linux one might have less memory and be paging more, and it has a much lower memory bandwidth than the Windows system.
You are using different compilers on the two systems. I would compare carefully the vectorization reports between the two, especially for B, and see if any important loops are vectorized in one case but not the other. (-vec-report2 or /Qvec-report2). You might also turn on the report for higher level loop optimizations with -opt-report-phase hlo (/Qopt-report-phase hlo) and compare. Perhaps the intrinsic or Fortran90 array notation is not being optimized as effectivelywith the older compiler.
This is a big factor, though...

0 Kudos
jimdempseyatthecove
Honored Contributor III
690 Views
Maybe you are looking in the wrong place.

Does the program have an initialization phase where it is reading in data and/or performing a large number of allocations?

Does the program have a termination phase where it is writing out data and/or performing a large number of deallocations?

If so, or unknown, try inserting timer code were you obtain the time _after_ initialization, and then again _before_ termination phase. IOW time just the computational part of the application.

10-20 times slower for B relative to A on the same machine (but different platform) cannot be accounted for by platform UNLESS one system experiences better cache hitsfor B relative to A than the other system.

E5530 4 core w/HT L3 1x8MB, L2 4x256KB
E54104 core wo/HT L2 2x6MB

The caches are significantly different. If you can eliminate the time to initialize and shut down the program and still see the A/Brelative performance difference then it might be good to run VTune or other profiler capable of looking at cache hit/miss data.

Jim Dempsey
0 Kudos
joey_hylton
Beginner
690 Views
Martyn and Jim, Thanks for your reply.
As you mentioned, the memory may be the reason. The time I compared in my original post are only the massive computation part. However, the code A and B use different schemes. The basis math is to calculate the matrix-vector multiplication: S(n)=J(m,n)^T J(m,n) V(n). For code A, I only use two do-loops:
do i=1,m
do j=1,n
jv(i)=jv(i)+J(i,j)*v(j)
enddo
enddo
do j=1,n
do i=1,m
s(j)=s(j)+J(i,j)*jv(i)
enddo
enddo
Sincem and n are two large numbers, the memory usage is very large. So I change the calculation by using the relationship J(m,n)=F(n,p)xH(n,q) with m=pxq, and S(n)= H x F * F x H * V. So the memory saving is large. In the multiplications, I always use something as Hv(1:n)=H(1:n,i)*V(1:n) with the loop for i.
In my testing, I have two cases:
case 1 : A uses 3.7GB, and B uses 700MB. The windows has 16GB memory and Linux has 64 GB memory.
Case 2: A used 40.5GB, and B uses 2.5GB. Both windows and Linux machines have 64GB memory.
Both of the cases have the same performance differences as I mentioned in my original post.
Would you like to further analyse my problem? Thanks a lot.
0 Kudos
Reply