running time

princess_sophie · ‎11-26-2009

Hello,
I am working on two different Mac machines and the same code takes twice or three times more to run the same code in a newer machine and newer version of the complier!!!
In both I use the same instructions. The complete (and very long, sorry) options are:
macartney:kappa debora$ ifort -Wl,-stack_size,0x10000000 -O1 -shared-intel -o nameoftheprogram.f -L/Library/Frameworks/Intel_MKL.framework/Libraries/
em64t-I/Library/Frameworks/Intel_MKL.framework/Headers /Library/Frameworks/Inte
l_MKL.framework/Libraries/em64t/libmkl_intel_lp64.a /Library/Frameworks/Intel_M
KL.framework/Libraries/em64t/libmkl_intel_thread.a /Library/Frameworks/Intel_MK
L.framework/Libraries/em64t/libmkl_core.a /Library/Frameworks/Intel_MKL.framewo
rk/Libraries/em64t/libmkl_intel_lp64.a /Library/Frameworks/Intel_MKL.framework/
Libraries/em64t/libmkl_intel_thread.a /Library/Frameworks/Intel_MKL.framework/L
ibraries/em64t/libmkl_core.a -lguide -lpthread

When I use them in an MacPro "old" machine (Mac OSX version 10.5.6) Processor 2x3 GHz Quad-Core Intel Xeon, Memory 16 GB, and intel fortran version 10.5.8, my program runs in a reasonable time.
But if I use it in our Mac Pro new generation, Mac OSX version 10.5.8, Processor 2x2.93 GHz, memory 32 GB and compiler version 11.1.058, the same programs runs ~three times slower?
Could you please help me to optimize the run time in the second machine?
Thanx!

TimP · ‎11-26-2009

It looks like you are comparing a unified memory dual socket machine with a recent non-uniform memory machine. The default BIOS setting on the latter usually involves alternating memory access to the local and the remote memory. In order to optimize, you select the NUMA BIOS setting and give increased attention to avoidance of false sharing and affinity issues. You may have to give attention also to "first-touch" allocation, where the first time your program initializes data arrays it is done with the same OpenMP schedule and affinity as in the bulk of the later accesses.

princess_sophie · ‎11-26-2009

Quoting - tim18

It looks like you are comparing a unified memory dual socket machine with a recent non-uniform memory machine. The default BIOS setting on the latter usually involves alternating memory access to the local and the remote memory. In order to optimize, you select the NUMA BIOS setting and give increased attention to avoidance of false sharing and affinity issues. You may have to give attention also to "first-touch" allocation, where the first time your program initializes data arrays it is done with the same OpenMP schedule and affinity as in the bulk of the later accesses.

Dear tim18,
Thank you for your fast response. However, I am a new user of this machines and operating systems and actually I am not related with the computer science, so I would ask you if you could explain your answer and or give more details... please? Sorry for disturbing you, I am just trying to have the results of my code much faster.

jimdempseyatthecove · ‎11-26-2009

Princess,

Can you provide the processor numbers for the two machines. Also include memory information (number of sticks, speed, etc...)

NUMA configuration considerations would not alter performance by 2x to 3x times.
Cache size and type, combined with your application's memory usage may exhibit this.

Jim Dempsey

TimP · ‎11-26-2009

Quoting - princess_sophie

Dear tim18,
Thank you for your fast response. However, I am a new user of this machines and operating systems and actually I am not related with the computer science, so I would ask you if you could explain your answer and or give more details... please? Sorry for disturbing you, I am just trying to have the results of my code much faster.

If the bulk of your time is spent in MKL, it may be sufficient to set the BIOS NUMA mode, and set the KMP_AFFINITY environment variable. As MKL doesn't normally use HyperThreads, the process may be simplified by disabling HT in the BIOS. It's made more complicated by the lack of a uniform scheme for BIOS numbering of cores and hyperthreads, so, unfortunately, it does involve investigation on your part. If you would set KMP_AFFINITY=compact,0,verbose and show us the resulting screen echo, we could tell you if that appears to be working.
Also, if it is primarly a concern with MKL performance, the MKL forum would be a good resource.

princess_sophie · ‎12-01-2009

Hello Jim,
Thank you a lot for your help. I do not have any idea about what NUMA is, so I'd better fisrt give you some details about the machines, hopefully this information will be enough to determine the optimization of the program execution.
For the"faster" computer I have the following information:
Model Name: Mac Pro
Model Identifier: MacPro2,1
Processor Name: Quad-Core Intel Xeon
Processor Speed: 3 GHz
Number Of Processors: 2
Total Number Of Cores: 8
L2 Cache (per processor): 8 MB
Memory: 16 GB
Bus Speed: 1.33 GHz
Boot ROM Version: MP21.007F.B06
Memory:
Four of this
DIMM Riser A/DIMM 1:

Size: 2 GB
Type: DDR2 FB-DIMM
Speed: 667 MHz
Status: OK

And four of this
DIMM Riser B/DIMM 1:

Size: 2 GB
Type: DDR2 FB-DIMM
Speed: 667 MHz
Status: OK

Whereas for the "slow" machine I found:

Model Name: Mac Pro
Model Identifier: MacPro4,1
Processor Name: Quad-Core Intel Xeon
Processor Speed: 2.93 GHz
Number Of Processors: 2
Total Number Of Cores: 8
L2 Cache (per core): 256 KB
L3 Cache (per processor): 8 MB
Memory: 32 GB
Processor Interconnect Speed: 6.4 GT/s
Boot ROM Version: MP41.0081.B04
SMC Version (system): 1.39f5
SMC Version (processor tray): 1.39f5

With 8 memory slots like this:
Memory Slots:

ECC: Enabled

DIMM 1:

Size: 4 GB
Type: DDR3 ECC
Speed: 1066 MHz
Status: OK

Is this enough information de define why one is 3 times slower? and would it be possibel to help me to optimize the best compilation parameters for each computer?
Thank you all!

Quoting - jimdempseyatthecove

Princess,

Can you provide the processor numbers for the two machines. Also include memory information (number of sticks, speed, etc...)

NUMA configuration considerations would not alter performance by 2x to 3x times.
Cache size and type, combined with your application's memory usage may exhibit this.

Jim Dempsey

Ron_Green · ‎12-01-2009

you have 4 variables: 2 machines and 2 compilers. Compile the code with -i-static on each machine. Copy the executables from each machine to the other machine. Run both executables on each machine. This will determine if it is the compiler or the computer that is responsible.

ron