- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am working on two different Mac machines and the same code takes twice or three times more to run the same code in a newer machine and newer version of the complier!!!
In both I use the same instructions. The complete (and very long, sorry) options are:
macartney:kappa debora$ ifort -Wl,-stack_size,0x10000000 -O1 -shared-intel -o nameoftheprogram.f -L/Library/Frameworks/Intel_MKL.framework/Libraries/
em64t-I/Library/Frameworks/Intel_MKL.framework/Headers /Library/Frameworks/Inte
l_MKL.framework/Libraries/em64t/libmkl_intel_lp64.a /Library/Frameworks/Intel_M
KL.framework/Libraries/em64t/libmkl_intel_thread.a /Library/Frameworks/Intel_MK
L.framework/Libraries/em64t/libmkl_core.a /Library/Frameworks/Intel_MKL.framewo
rk/Libraries/em64t/libmkl_intel_lp64.a /Library/Frameworks/Intel_MKL.framework/
Libraries/em64t/libmkl_intel_thread.a /Library/Frameworks/Intel_MKL.framework/L
ibraries/em64t/libmkl_core.a -lguide -lpthread
When I use them in an MacPro "old" machine (Mac OSX version 10.5.6) Processor 2x3 GHz Quad-Core Intel Xeon, Memory 16 GB, and intel fortran version 10.5.8, my program runs in a reasonable time.
But if I use it in our Mac Pro new generation, Mac OSX version 10.5.8, Processor 2x2.93 GHz, memory 32 GB and compiler version 11.1.058, the same programs runs ~three times slower?
Could you please help me to optimize the run time in the second machine?
Thanx!
I am working on two different Mac machines and the same code takes twice or three times more to run the same code in a newer machine and newer version of the complier!!!
In both I use the same instructions. The complete (and very long, sorry) options are:
macartney:kappa debora$ ifort -Wl,-stack_size,0x10000000 -O1 -shared-intel -o nameoftheprogram.f -L/Library/Frameworks/Intel_MKL.framework/Libraries/
em64t-I/Library/Frameworks/Intel_MKL.framework/Headers /Library/Frameworks/Inte
l_MKL.framework/Libraries/em64t/libmkl_intel_lp64.a /Library/Frameworks/Intel_M
KL.framework/Libraries/em64t/libmkl_intel_thread.a /Library/Frameworks/Intel_MK
L.framework/Libraries/em64t/libmkl_core.a /Library/Frameworks/Intel_MKL.framewo
rk/Libraries/em64t/libmkl_intel_lp64.a /Library/Frameworks/Intel_MKL.framework/
Libraries/em64t/libmkl_intel_thread.a /Library/Frameworks/Intel_MKL.framework/L
ibraries/em64t/libmkl_core.a -lguide -lpthread
When I use them in an MacPro "old" machine (Mac OSX version 10.5.6) Processor 2x3 GHz Quad-Core Intel Xeon, Memory 16 GB, and intel fortran version 10.5.8, my program runs in a reasonable time.
But if I use it in our Mac Pro new generation, Mac OSX version 10.5.8, Processor 2x2.93 GHz, memory 32 GB and compiler version 11.1.058, the same programs runs ~three times slower?
Could you please help me to optimize the run time in the second machine?
Thanx!
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It looks like you are comparing a unified memory dual socket machine with a recent non-uniform memory machine. The default BIOS setting on the latter usually involves alternating memory access to the local and the remote memory. In order to optimize, you select the NUMA BIOS setting and give increased attention to avoidance of false sharing and affinity issues. You may have to give attention also to "first-touch" allocation, where the first time your program initializes data arrays it is done with the same OpenMP schedule and affinity as in the bulk of the later accesses.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - tim18
It looks like you are comparing a unified memory dual socket machine with a recent non-uniform memory machine. The default BIOS setting on the latter usually involves alternating memory access to the local and the remote memory. In order to optimize, you select the NUMA BIOS setting and give increased attention to avoidance of false sharing and affinity issues. You may have to give attention also to "first-touch" allocation, where the first time your program initializes data arrays it is done with the same OpenMP schedule and affinity as in the bulk of the later accesses.
Dear tim18,
Thank you for your fast response. However, I am a new user of this machines and operating systems and actually I am not related with the computer science, so I would ask you if you could explain your answer and or give more details... please? Sorry for disturbing you, I am just trying to have the results of my code much faster.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Princess,
Can you provide the processor numbers for the two machines. Also include memory information (number of sticks, speed, etc...)
NUMA configuration considerations would not alter performance by 2x to 3x times.
Cache size and type, combined with your application's memory usage may exhibit this.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - princess_sophie
Dear tim18,
Thank you for your fast response. However, I am a new user of this machines and operating systems and actually I am not related with the computer science, so I would ask you if you could explain your answer and or give more details... please? Sorry for disturbing you, I am just trying to have the results of my code much faster.
Also, if it is primarly a concern with MKL performance, the MKL forum would be a good resource.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Jim,
Thank you a lot for your help. I do not have any idea about what NUMA is, so I'd better fisrt give you some details about the machines, hopefully this information will be enough to determine the optimization of the program execution.
For the"faster" computer I have the following information:
Model Name: Mac Pro
Model Identifier: MacPro2,1
Processor Name: Quad-Core Intel Xeon
Processor Speed: 3 GHz
Number Of Processors: 2
Total Number Of Cores: 8
L2 Cache (per processor): 8 MB
Memory: 16 GB
Bus Speed: 1.33 GHz
Boot ROM Version: MP21.007F.B06
Memory:
Four of this
DIMM Riser A/DIMM 1:
Size: 2 GB
Type: DDR2 FB-DIMM
Speed: 667 MHz
Status: OK
And four of this
DIMM Riser B/DIMM 1:
Size: 2 GB
Type: DDR2 FB-DIMM
Speed: 667 MHz
Status: OK
Whereas for the "slow" machine I found:
Model Name: Mac Pro
Model Identifier: MacPro4,1
Processor Name: Quad-Core Intel Xeon
Processor Speed: 2.93 GHz
Number Of Processors: 2
Total Number Of Cores: 8
L2 Cache (per core): 256 KB
L3 Cache (per processor): 8 MB
Memory: 32 GB
Processor Interconnect Speed: 6.4 GT/s
Boot ROM Version: MP41.0081.B04
SMC Version (system): 1.39f5
SMC Version (processor tray): 1.39f5
With 8 memory slots like this:
Memory Slots:
ECC: Enabled
DIMM 1:
Size: 4 GB
Type: DDR3 ECC
Speed: 1066 MHz
Status: OK
Is this enough information de define why one is 3 times slower? and would it be possibel to help me to optimize the best compilation parameters for each computer?
Thank you all!
Quoting - jimdempseyatthecove
Princess,
Can you provide the processor numbers for the two machines. Also include memory information (number of sticks, speed, etc...)
NUMA configuration considerations would not alter performance by 2x to 3x times.
Cache size and type, combined with your application's memory usage may exhibit this.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
you have 4 variables: 2 machines and 2 compilers. Compile the code with -i-static on each machine. Copy the executables from each machine to the other machine. Run both executables on each machine. This will determine if it is the compiler or the computer that is responsible.
ron
ron
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page