Using VTune to troubleshoot applicaiton run time

CStac · ‎02-25-2015

Greetings, TL;DR I have two generations of processors where identical systems are drastically different in performance and I am trying to use VTune to figure out why. Full story. A user recently sent me a complaint that the new cluster was slower than the old cluster. This didn't surprise me too much as we got a really good deal on the procs for the new cluster and went with quantity for the parallel applications over a small number of "faster" processors. The old cluster was a hodge-podge collection of nodes comprised of whatever the fastest proc we could afford at that time. That caused a lot of problems for our parallel users so we wanted to stay close to a uniform cluster this time. I did a lot of application testing with a good sample of our apps and the difference was trivial between the fastest nodes on the old vs the new. The "fast" nodes on the old cluster are the Xeon Westmere X5687 http://ark.intel.com/products/52578/Intel-Xeon-Processor-X5687-12M-Cache-3_60-GHz-6_40-GTs-Intel-QPI The nodes on the new cluster are the Xeon Sandy Bridge E5-2670 http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI When he complained that it was four times slower, that bothered me. I ended up running a series of tests across multiple systems using the same binary and verified that: * The user just had some crazy bad luck when he initially did his benchmarking. However, he was right in finding that there is a problem. * One X5687 processor will run the code in 3 min 30 sec on average after many different runs including a fresh reboot. It is really consistent in running between 3:20 and 3:45. * A second X5687 processor that *should* be identical consistently runs just over 9 minutes. * One E5-2670 processor consistently averages 7 minutes * A second E5-2670 processor consistently averages 15 minutes (hence the 4x slower response from the user) * I have a wide variety of ranges from other E5-2670 processors with the average sitting closer to 9 minutes. * I have a Xeon X5675 that is my "If I break it no one cares" test system which I can beat up with VTune and testing. It consistently runs just under 4 minutes. My theories for the discrepancies between the processors are: * Possibly some sort of cache/memory alignment problem? * There is only one random number generated at the very beginning. The code should be pretty uniform after that point and the tests were all with the same binary. Maybe I need to compile with ifort to target specific architectures? Compiling a binary for each processor family did not seem to make much of a difference, but maybe there are other flags I should try. However, those points would explain the difference between processors, not the differences between the same processor type. Maybe it is something as simple as a CPU feature (like virtualization flag) enabled on one host and not the other. However, I can't seem to find that difference. So I have turned to VTune in an effort to figure this out. VTune has pointed out several issues with the code (which we are working on) and there are improvements to be made, but so far I don't see anything that would tell me why it runs slow on one and faster on an "identical" system. If it was just that one processor type was faster than the other then this wouldn't be of any issue. But I have been running tests, pouring over VTune output, and hitting up forums for the past few days and I feel like I am not getting anywhere in explaining this mystery. I would greatly appreciate advice/suggestions on how I might be able to better figure out why there is such a large difference between "identical" systems. What should i be looking for in VTune? Is there a specific test I should run? Thanks!

Peter_W_Intel · ‎02-26-2015

X5687 normal frequency is 3.6 GHz, but E5-2670's is only 2.6 GHz. However, E5-2670 has more cores, bigger smart cache, Instruction Set Extension is AVX, vs. X5687's SSE4.2. It means if you compiled code for X5687 processor, the binary runs on E5-2670 (no big memory access, run SSE4.2 instruction on E5-2670, only 8 threads paralleling work). X5687 has better performance because of base frequency. You can measure CPU clocks on critical functions, and CPI metric can be used.

> Maybe I need to compile with ifort to target specific architectures? Compiling a binary for each processor family did not seem to make much of a difference, but maybe there are other flags I should try.

1. That is right! You have to build code for the target system respectively. For example, try to run more parallel tasks on E5-2670, omp_set_num_threads(16) can be used, if you use OpenMP*.

2. Use AVX to instead of SSE4.2 - I recall that you can use advanced compiler switches, such as "-xHost"? Other advanced compiler switcher "-O2" or "-O3" can be used?

3. You measure time spent on critical function by using VTune report, you can filter result by function, module. Also, start time / end time on critical function can be observed in time line panel of bottom-up report.

4. You can use general-exploration analysis to compare vtune results from two sessions (which is different binary for different processor), L1/L2/LLC cache miss

5. Using Concurrency analysis to know Parallelism Level.

Bernard · ‎02-26-2015

You need also take into account how the same code will benefit from newer generation of Processors (Xeon SB) in your case. For example if profiled code which is not easily parallelized and scales well with increased CPU frequency so in this case I think that Westmere CPU can outperform SB CPU.

CStac · ‎02-26-2015

Thanks for the compiler switches. I will test them out and see what difference they make.

Peter Wang (Intel) wrote:

4. You can use general-exploration analysis to compare vtune results from two sessions (which is different binary for different processor), L1/L2/LLC cache miss

Sinec the real problem I have is that two "identical" systems are giving me drastically different results, how can I use this "general-exploration analysis" to compare the two "identical" systems? I know there is a difference between different processor families but with two identical processors I should be getting very similar results. I guess I am not sure what I should be looking at/for between two different runs on two different hosts with the same processor.

McCalpinJohn · ‎02-26-2015

The processors may be identical, but the runtimes are not, so there must be one or more reasons for the difference.

It would not make sense to do a *static* analysis of the binary (since the binary is identical), but the "general exploration analysis" of VTune is a dynamic runtime analysis that is likely to show ways in which the execution differs across the two systems.

CStac · ‎02-26-2015

Thanks John. I ran a general exploration on one of the hosts and I am still trying to sort out what it is telling me. I will run another test on the other host in a bit to see how it compares.

McCalpinJohn · ‎02-26-2015

Good luck! Sometimes seeing "how" the runs are different points fairly directly to "why", but sometimes it does not really help at all :-(

We have a Xeon v3 node in our lab that gives poor and variable performance compared to other nodes that seem identical. It took a while to realize that there is a problem with the cooling system on this particular box, so it is overheating very quickly and throttling the clocks. We will ship it back to the vendor, but in the meantime I was encouraged to build a bunch of test code so that we can monitor temperature and frequency on all of the boxes.

Using these tools we discovered that there is a fair amount of variability in the average clock speed of identical Xeon v3 processors when they are running in the power-limited regime. (So far this power limitation only happens when we run LINPACK using more than 1/2 of the cores, but we are on the lookout for other codes that might also run into the power limits.)

Bernard · ‎02-26-2015

>>>I know there is a difference between different processor families but with two identical processors I should be getting very similar results.>>>

I presume that both processors are running on different physical machines.There are many factors which can skew the results. Start from checking CPUs temperature during the load , verify that voltage fluctuation is in norm. From the software point of view and if OS is Windows I would like to recommend perform system wide testing with such a tool like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).

Peter_W_Intel · ‎02-26-2015

>... like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).

Also can use VTune : measure wait time / wait counts, if there are some IO waits or threads' pending, etc.

CStac · ‎02-28-2015

Greetings, Just an update from the past 6 hours of testing and running code through VTune on multiple hosts. The short of it is this: * If I run one process (stand alone, through scheduler, or with VTune it doesn't matter) the time varies pretty wildly. Usually the times on the same host are very similar, but between hosts they can be 6 minutes or 15 minutes. * If I run multiple processes, not only do all the times run around 5:30min, but it is consistent across every node I sampled. * I have poured through multiple VTune runs and I can't spot anything between one that runs in 7min vs one that runs in 15min (except the obvious time itself). Functions spend equivalent amount of time in functions doing very similar things. Knowing that if I run multiple jobs I get consistent times tells me that I am having something funny going on with the Turbo Max Frequency scaling. From what I can tell, I don't see the frequency every change but it has to be doing something to get these crazy times when it is a single thread and have consistent times when there are multiple threads. The real kicker I don't quite yet understand is why I am getting /better/ times with multiple single-core jobs. I would think that the 3.3Ghz Max Turbo would make a single core job run the fastest, but that isn't what I am seeing. Is it possible I am missing a package for my system (Scientific Linux 6.6 which is a clone of Red Hat EL6.6)? I thought the kernel handled all of the scaling issues...I need to research this more. The other thing that I am going to look at Monday when I get into the office is double checking every BIOS setting for the Turbo/frequency scaling. Maybe there are BIOS differences that I missed. Thanks!

Bernard · ‎03-01-2015

Peter Wang (Intel) wrote:

>... like Windows Performance Recorder (Xperf) although VTune can provide similar information WPR is gathering a lot of so called performance events from various providers(Disk I/O, DPCs,ISRs etc....).

Also can use VTune : measure wait time / wait counts, if there are some IO waits or threads' pending, etc.

Yes of course. I usually start system wide profiling with the help of WPR and later switch to VTune for in depth application profiling.

CStac · ‎03-02-2015

SOLVED! As with most things in life, the answer is ridiculously simple. But the journey to get to that answer was a challenging learning process. $ sudo yum install -y cpuspeed $ sudo service cpuspeed start Now the slowest node that used to run in 15minutes now runs the job in 3min 34 seconds. I still find it _VERY_ curious why there is so much variance between processors without that package, but with that package installed so far my testing has revealed that the nodes are now very similar in terms of performance as they should be. How is that for a simple smack to the back of the head? Here is the next question for the VTune experts. Knowing that it was the Linux kernel not scaling properly due to a missing software package, what should I look for in my VTune runs that might have clued me into this problem faster? I don't see anything specifically saying "Here is what the average CPU Frequency was during your run" or anything remotely similar. Can anyone point out a parameter I should look at? I would greatly appreciate any information or wisdom you might have for my own personal VTune learning experience. Thank you!

high_end_compute · ‎09-20-2024

I think vtune summary tells you processor frequency BUT i always look (on linux system) at /proc/cpuinfo whenever running jobs. some nodes will have fixed but some will have variable frequencies (and for the later you may need to warm-up if you want to compare peak code performances).