Intel® C++ Compiler
Support and discussions for creating C++ code that runs on platforms based on Intel® processors.

Itanium vs xeon

tangouniform
Beginner
236 Views

Ihave been appointed theperformance person on a large simulation that was originally built on an IRIX platform. We have "upgraded" to xeon boxes for developement and itaniums for production. The simulation is a discreet event simulation capable of running on many nodes/processors. Our product has 1.5mil sloc and is a bear to tune, to date, we have always used internal performance data to tune the software, but now it is not sufficient. The data we see coming out of our framework says we should have no issues running real-time... but it is not.

This is a new position for me, so I have spent months researching and about every week I read something that I think will solve our problems, but it does not. So any info you can provide is helpful.

The itanium is a pig... I know that is not true and that the application needs to be tuned, but what is the best way to go about it. I try vtune, but it dies, I have used PGO with minor success. The compile options I have used and thier results follow:

using icpc on itanium

-O2 => CPU time used = 1050 sec; wall time = 474 sec

-O3 => CPU time = 953 sec; wall = 440 sec

-O3 -mtune=itanium2-p9000 => CPU = 943 sec; wall = 437 sec

-O3 -mtune=itanium2-p9000 -prof-use =>CPU=801; wall = 378

Now at this point I was excited, that is a definate improvement, but then compared to an identical run on a Xeon platform

-O2 => CPU = 517; Wall = 255 sec

So all this time making runs for the profiler to generate profiling information was for nought.. the xeon is still faster.

What/where is the best place to start to tune this application? Do tuning changes directed towrd the itaniums usually negatively impact xeon performance?.. Apparently we are to support both now.

Sorry for the lengthy post, but I am at my wits end

Thanks in advance

Tango

0 Kudos
3 Replies
TimP
Black Belt
236 Views
There really is no way to comment on this without more information from you. Even for applications which I have worked on for several years, I could easily guess wrong about which data sets would run faster on the best IA-64 systems, compared with the best Xeon systems, and several experts have woefully misguessed which Xeon systems are appropriate.
From this point, if it's worth the effort to you, you should be profiling your application with gprof, VTune, or some other appropriate tool, finding the hot spots, and determining whether improvements can be made there. You may have done most of what can be done without looking at details.
You didn't say whether you have checked function by function whether -O2, with or without software prefetch, or -O3 may be best. This is likely to vary function by function within your application. I don't understand from your statement whether you think you have covered this already.

tangouniform
Beginner
236 Views

Tim,

Thanks for the reply.The software is a parallel discreet event simulation, The way it works is:

The framework spawns off executables that run on the number of nodes the user has specified. The models/executables then run as fast as they can to a certain time.

When a model reaches a state of ready for synchromization the framework "spins" doing meaningless work and system calls looking for cross node messages. During this time if a message is received that is "in the past", the model roles back to that time and re-executes its code.

Once everything is synched up, all nodes are released and each model races off.

Since the issues are more prevalent when running on more than one node, gprof misses some of the issues. I have agressively used gprof to tune the code for one node, but I need to understand what is going on when more than one node is involved. I think VTune can give this information, but as I have yet to create a successful activity with it I am not sure of its capability. I can only hope it reveals something that gprof has been unable to uncover. I am investigating how to use perfmon to get some insight on the hot spots/bottlenecks, but VTune looks so pretty. :)

my current optimization settings... and they have shown some improvement are -O3 -mtune=itanium2-p9000 and in some situations -opt-mem-bandwidth1

As far as checking function by function, with or without prefetch, I am not sure what the is the best way to do that. Would VTune give me insight on that?

At the deployed sites we are starting to use cpusets and dplace to enforce processor affinity.

Do these seem like acceptable paths to tune the software for the itaniums? It seems that itaniums are one of the few options that we have. Ourcustomersconsistantly need 40-50 nodes to as the scenarios we run have 100s of models involved.

Thanks

TimP
Black Belt
236 Views

I was assuming that you could compare performance by using a single thread, or by comparing the dominant thread seen by gprof. I admit that I have been lucky in this respect.

VTune does collect events by thread, allowing you to compare performance function by function, for each thread. You would build the application with several likely groups of optimization switches, profile each build, and search to see which optimization switches are best for each function. Needless to say, there could be some cross effects; aggressive optimizations which increase code size could speed up an important function but slow down several others slightly, for example by increasing instruction cache and TLB misses.

With VTune, you have the ability to collect specific events, and to see a view showing the exact point in the application which is responsible. Much of this is beyond the basic functionality of sorting out whether you should select optimizations function by function.

IA-64 option -O3 does aggressive loop splitting and unrolling, to facilitate software prefetch. If your data are already resident in cache, this may slow down parts of your application. -O2 still spends some time on software prefetch; for example, bringing data into cache may waste time if it is done in preparation for code branches which are seldom executed, so there are options to remove those software prefetches.

Current Xeon models do well in MPI cluster applications. The range of applications where Itanium is superior has been decreasing with each group of new model introductions. On future roadmaps, Itanium is marketing primarily as a high reliability enterprise solution, not asthe "efficientperformance" solution in those cases where Xeon is satisfactory. Several vendors would introduce large multi-processor systems tailored for specific types of applications, based on the future Quick Path Interconnect Xeon products. This prospect already has limited the level of effort people are willing to invest in optimizing current products.

Reply