Re: Intel Compiler Speed Disappointing - What am I doing wrong

pickup02westnet_com_ · ‎04-01-2009

Have just installed the trial Intel Fortran compiler on a MAC 2.8Gz dual quadcore wth 18 GB RAM & 3 Gb/s drives.

Compiled my first program with complier switches -assume byterecl -xT -ipo -m64

When i run the same program on a PC with a core duo 2.2 Ghz with 4 GB RAM & similar hard drives compiled with an old Lahey F95 compiler, it is about 25% faster.

Surely I'm doing something wrong?

TimP · ‎04-02-2009

My old Lahey compiler generates x87 code. If you mix single and double precision, that may run faster than SSSE3 code. In such a case, it's possible -auto-double would run faster.
Who will be the first to reconstruct the source code from this description?

pickup02westnet_com_ · ‎04-02-2009

Quoting - tim18

My old Lahey compiler generates x87 code. If you mix single and double precision, that may run faster than SSSE3 code. In such a case, it's possible -auto-double would run faster.
Who will be the first to reconstruct the source code from this description?

No, it's all single precision & integer. There is a lot of sorting & IF ... THEN operations but not a lot of calculation.

Some large arrays get read in but that's about the same speed.

Ron_Green · ‎04-02-2009

Quoting - pickup02westnet.com.au

No, it's all single precision & integer. There is a lot of sorting & IF ... THEN operations but not a lot of calculation.

Some large arrays get read in but that's about the same speed.

This still is not enough information to help, can you share the code, input files, and expected wall clock times?

But a couple of thoughts come to mind: your PC/Lahey was probably 32bit. Why are you throwing the -m64 switch? Do you expect 64bit code to be faster than -m32 32bit code? Or do you need the large arrays afforded by -m64?

And you came out of the gate shooting for maximum optimization. What did -O0 show? Did you try -O1? If your code is branchy sorting code there may not be any benefit from vectorization and other O2 level optimizations.

but again, hard to say without a good example code. It's all speculative at this point.

ron

pickup02westnet_com_ · ‎04-02-2009

Quoting - Ronald Green (Intel)

This still is not enough information to help, can you share the code, input files, and expected wall clock times?

But a couple of thoughts come to mind: your PC/Lahey was probably 32bit. Why are you throwing the -m64 switch? Do you expect 64bit code to be faster than -m32 32bit code? Or do you need the large arrays afforded by -m64?

And you came out of the gate shooting for maximum optimization. What did -O0 show? Did you try -O1? If your code is branchy sorting code there may not be any benefit from vectorization and other O2 level optimizations.

but again, hard to say without a good example code. It's all speculative at this point.

ron

Thanks Ron.

Am happy to share the code but there's 1300 lines. What's the best way of making it available? Input files are too large to make available.

m64 was turned on because I need large arrays down the track. Turning it off gave a small speed increase but not much.

I've turned off the other compiler switches - minimal effect on execution time.

Currrently trying O0 & O1

TimP · ‎04-02-2009

I assume you're using "-m64" as shorthand for the use of the Intel64/x86_64 compiler. Are you saying that your application runs faster when compiled with the 32-bit compiler? As far as I know, literally using the -m64 switch has no effect, unlike gfortran, where -m32/ could be used to select the 32-bit compiler when using the 64-bit compiler driver.
If you can show a subroutine which performs most of the work, that might be enough.

pickup02westnet_com_ · ‎04-02-2009

Quoting - tim18

I assume you're using "-m64" as shorthand for the use of the Intel64/x86_64 compiler. Are you saying that your application runs faster when compiled with the 32-bit compiler? As far as I know, literally using the -m64 switch has no effect, unlike gfortran, where -m32/ could be used to select the 32-bit compiler when using the 64-bit compiler driver.
If you can show a subroutine which performs most of the work, that might be enough.

I've turned off m64 ... it made no difference.

Some comparison data:

Lahey Fortran 95 on PC : 13 mins 32 secs

PGP Fortran 64 bit on Mac Pro (default compiler settings) 18 mins 46 secs

Intel 32 bit on Mac Pro, switches -o3 -xT -ipo 17 mins 10 secs
Intel 64 bit on Mac Pro, switches -o3 -xT -ipo 17 mins 18 secs
Intel 64 bit on Mac Pro, switches -o2 16 mins 34 secs
Intel 64 bit on Mac Pro, switches -o1 16 mins 25 secs

GCC Fortran 95 on Mac Pro - slower than Intel & PGP compilers

This is starting to look like these MACs are not as fast as I'd expected as Fortran number crunchers.

I plan to try Absoft's compiler next & to try out to new Lahey compiler once I've got Linux up on the Mac. Aftter that I'll look a pasting some code.

Thanks again for the suggestions

jimdempseyatthecove · ‎04-03-2009

Can you determine if the problem is related to computation speed or I/O speed? If I/O then maybe the defaults for I/O buffering are different, but can be changed on your OPEN statement(s) (BLOCKSIZE, BUFFERED, BUFFERCOUNT, ...)

if you have a profiler you canfind the bottleneck.

(I assume you have turned off runtime checks such as array bounds checking)

Jim Dempsey

TimP · ‎04-03-2009

If you're talking about a Mac with Nehalem style CPU, evidently that CPU has different strengths and weaknesses than other recent CPUs. Your finding a code sequence which runs faster with an old compiler might be interesting, if you would be more informative.
The Mac Nehalem product made the somewhat strange decision to use FB-DIMM, like the Woodcrest, possibly so as to avoid developing a new motherboard for DDR3, so its performance can't be approach the Xeon 5500 product which was launched more recently. It doesn't seem like a choice which would correspond with the idea most people have of "number crunching."
You still leave too many questions unanswered, and this thread has become unproductive.

Ron_Green · ‎04-03-2009

and to follow up on Tim's notes on hardware differences:

You are trying to compare a dual-socket server with a single-socket server. The memory latency is quite a bit different in these 2 scenarios. I remember a similar case a few years back where an application ran faster on an iMac than a Mac Pro. This application was pointer chasing very small data structures. The memory accesses were essentially random with reads and writes shotgunned all over the address space. In this scenario, caching just gets in the way - memory bandwidth is irrelevant in this case and all that matters is latency time to memory.

A single-socket system doesn't have to worry about other processors potentially having data cached on another socket. If the data isn't in it's caches, it'll just go to memory and get it. End of story.

For dual and quad socket systems it's more complex. If you can't find the data in your local caches, there is a chance it's in the cache of another socket/processor. So there is addition latency to check "hey, do you have the data in address X?", "Nope, ain't got it", "OK, I'll go out to memory then and see if I can find it". There is that addition overhead or latency that you don't have with a single socket system.

Dual and quad socket servers are designed for THROUGHPUT and not raw single-thread performance. Try this, try running 8 instances of this application simultaneously on the Mac Pro and the PC. I think I know which system will win. The single proc system will thrash around like a fish out of water. The Mac Pro is designed to run multiple tasks simultaneously, not run a single program fast. This is so you can be encoding multiple video files simultaneously and still have plenty of horsepower to run Photoshop, Excel, Firefox, etc with each application appearing unperturbed by the others. OR if your application is multithreaded you can see dramatically faster execution of parallel applications.

If you just run 1 code and it's single threaded (not parallel) it probably will run faster on a single-socket server.

just my 2cents.

ron

pickup02westnet_com_ · ‎04-03-2009

1. Yes, array bounds checking is off

2. It's a Harpertown Xeon, not a Nahalem

3. Tim18 ... thanks for your comments.

4. Ron ... I guess you've nailed my problem.

Thanks to all for responding.

TimP · ‎04-03-2009

I don't know detailed characteristics of MacOS, but:
supposing you run a single thread application on a dual socket machine with 4 distinct cache, like Harpertown, you may see a significant improvement when you use a utility such as taskset to pick a preferred core for it to run on.
In the worst case, if your application continually jumps cores, it may have to keep all 4 cache up to date, rather than just one.

Ron_Green · ‎04-03-2009

Mac OS X is generally quite good at thread affinity, unlike that one OS that somehow became quite popular. However, it does not support thread binding. there is a good tech note from Apple on this:

http://developer.apple.com/ReleaseNotes/Performance/RN-AffinityAPI/index.html

To quote: "Mac OS X does not export interfaces that identify processors or control thread placement-explicit thread to processor binding is not supported. Instead, the kernelmanages all thread placement. Applications expect that the scheduler will, under most circumstances, run its threads using a good processor placement with respect to cache affinity."

Our OpenMP runtime developers have also stated that Mac OS X is quite good with thread affinity and in MOST cases threads will stay bound to one core for their lifetime.

The API noted in the above URL allows one to give the OS hints about which threads should be considered part of an affinity group. However, it does not guarantee binding.

The short answer: the single threaded application will stay on one core for it's lifetime unless the system is REALLY REALLY bogged down.

ron

jimdempseyatthecove · ‎04-04-2009

If Mac OS X has sticky (tacky) thread affinity then on the now Intel based systems an application can use CUPID (or appropriate system function call) to obtain the information to determine which cache the thread currently resides on. All threads in the application can periodically do this (instead of do once as you can on systems with thread affinity binding). This information can then be used in a filter in your application in the instances where cache locality makes a performance difference.

[cpp]! Note the following is NOT a parallel do
!
!$omp parallel private(i)
do iObject = 1, nObjects
  if(IAND(Object(iObject)%Cache, myCache) .ne. 0) call DoWork(iObject)
end do
!$omp end parallel

[/cpp]

Or variations on the theme

Jim Dempsey

pickup02westnet_com_ · ‎04-05-2009

Did a bit more work on this by inserting a few time calls into the code. The speed reduction is not related to I/O.

I tried creating a WIN XP virtual machine on the MAC & running the LAHEY code in that, working on the assumption that, if the slowdown is related to MAC processor hardware rather than OS issues, it would show up in that & would be slower than the PC.

Result: WIN XP Virtual Machine on the MAC 11 mins 08 seconds. This is faster than the PC (13:32) & the fastest Intel Compiler (16:25) even though I assigned two processors to the VM.

Beats me but it looks like something to do with MAC OS.

Tried the ABSOFT 64 bit compiler as well. With "Advanced" optimisation, run took 35 mins with array checking off, etc, etc.

TimP · ‎04-06-2009

With timing calls, or a profiler such as VTune, you ought to be able to isolate the code where the difference in performance occurs, and characterize it, or maybe quote it here.

Steve_Cousins · ‎04-14-2009

Quoting - pickup02westnet.com.au

Beats me but it looks like something to do with MAC OS.

What does the "load" look like (using the "top" command) when you run this? I ran some comparisons a number of years ago (four or five) on a G5 Xserve and ran into the same thing using the IBM compiler. I ran the 2.0 Ghz Xserve (two CPU's) running OS X against a Dual Opteron system running Linux (with the Portland Group Fortran compiler):

CPU(s) Time (mm:ss, less time is better)
Opteron 1.6 Ghz (1 CPU): 7:15
Xserve G5 2.0 Ghz (2 CPUs): 7:11
Opteron 1.6 (2 CPUs): 5:37

The load on the Opterons was pegged at 1. I could never get the load on the Xserves above .7 and there was no I/O so there were no files to wait for. We have a cluster of 256 of these Xserves and we went to a system that allows each job to specify whether the job should use OS X or Linux. With about a 20% speed advantage when running Linux, we hardly ever see anyone use OS X jobs anymore.

Bottom line, based on your findings, it looks like OS X is still not a very efficient OS for running CPU intensive code. Maybe there are things in the OS that can be turned off to gain some efficiency, although the test above was on a *really* stripped down instance of OS X.

Steve