Weird CPU usage between x5560 and x5660 systems

sridhar_basam · ‎02-01-2012

Hey,

First time post here. Hopefully this is the right forum.

We are seeing some very weird CPU usage numbers on the exact same software running on a X5560 and X5660 system. The server in question is a udp request/response server. On the older gen Dell systems with the X5560 processor, we typically see the process using on the order of 4-5% cpu for a unit of load. For the very same unit of load, the new Dell systems with the X5660 system uses 20-25% CPU. I am baffled about why this is happening. I have tried to use oprofile to see if there is a different pattern but see a similar pattern between the 2 systems, just that the newer system is using a lot more cycles. Anyone have any ideas on what might be happening?

Sridhar

P.S Both systems run the same exact version of RHEL5.6

Patrick_F_Intel1 · ‎02-01-2012

Hello Sridhar,
I know you know this but for others info...
The x5560 system is Nehalem-EP, 4 cores, 2.8Ghz, 8 MB LLC.
The x5660 system is Westmere-EP,6 cores, 2.8Ghz, 12 MB LLC.
So the Westmere based system should be better.

I've seen some software behave poorly due to the not-power-of-2 number of cores.
Also, now there are 50% more cpus on the system. This can expose poor scaling of locking schemes.
Have you tried booting the westmere box with just 4 cores per socket?
If the software goes back to the 4-5% cpu usage then we can probably assume the issue is not the box but probably related to going from 16 CPUs to 24 CPUs.
Also, how are you measuring performance?
How many units of work is each system able to do per second?
When you say "oprofile to see if there is a different pattern but see a similar pattern between the 2 systems, just that the newer system is using a lot more cycles"... this leaves me scratching my head...
Can you expand on 'similar pattern but new system is using a lot more cycles'?
Thanks
Pat

sridhar_basam · ‎02-01-2012

Thanks for the response Pat.

I already tried booting the Westmere host with 4 cores per socket and it did not help. I also took at look at the BIOS settings to see if by chance the newer system had dynamic power scaling but they are both similar, they let the OS decide and we don't do any frequency scaling at the OS layer.

Both the Nehalem and Westmere system are behind a LB listening for UDP requests so they are getting very similar request load. The LB just distributes the incoming queries in a round robin fashion. Performance is measured in terms of how many UDP responses are successfully sent without dropping packets from the socket queue.

For peak performance each system can do 60k pkts/sec but we don't run the system that high all the time, typical usage is around 20k pkts/sec. For the same inbound packet rate, the older system shows much lower CPU usage.

Sorry about not being clear with the oprofile output. What i meant was that when one looked at the opreport output for the process, the percentage of time being spent in the top routines are the same but the total cycles on the Westmere system is much higher. I can't paste the exact opreport output but hopefully what i have below illustrates what i was trying to say.

Nehalem Westmere

10000 10% symbol1 40000 10% symbol1

8000 8% symbol2 32000 8% symbol2

I ran opcontrol for the same amount of time on both systems. The number of unhalted instructions on the Westmere box is 3-4 times higher but the percentage of time for each of the functions is similar between the 2 processors.

thanks,

Sridhar

P.S It is a single threaded process if that helps.

Patrick_F_Intel1 · ‎02-01-2012

Hello Sridhar,
Let me see if I understand correctly.
You collected the oprofile data from both systems and:
1) you ranfor theroughly the same amount of time,
2) you had roughly the same load level (in terms of pkts/sec) for both systems,
So the 2 systems should be doing the same amount of work during the interval
3) the profile of the oprofile data is the same (that is, the relative ratio of time spent in routines is about the same)
I assume the percentages above (like 10% in symbol1) means that 10% of the total clockticks is in symbol1
4) the number of clockticks in each routine is about 4x higher on westmere
5) the number of instructions retired in each routine is about 4x higher on westmere

Are these statements more or less right?
If so, it really makes me think that the software is somehow spinning looking for work.
That is, the amount of work getting done (total packets) is the same but it takes 4x more instructions to get the work done.

Would you mind running the Intel PCM utility on both systems under load and sending us the output?
You can send the output by private email if you are worried about proprietary information.
PCM is available at http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ and is easy to build on linux (just unzip it and do 'make').
To run it do './pcm.x 1' or, during the load do './pcm.x 5 > nhm.txt' and './pcm.x 5 > wsm.txt'.
Pat

sridhar_basam · ‎02-02-2012

Hey,

Yes for questions 1 through 4. For 5 i didn' monitor the INST_RETIRED event, i monitored theCPU_CLK_UNHALTED event. I can redo the oprofile collection and collect both the instructions retired and unhalted events.

I will try to get the PCM output you asked for but will have to run that by my manager.

thanks for your help,

Sridhar

Patrick_F_Intel1 · ‎02-02-2012

Thanks for the clarification.
Running PCM would give us a measurement on both systems with the same tool.
It would help us get an overall view of CPU, QPIand memory performance.
The overhead is very low (neglible).
You might need to do a 'modprobe msr' before running pcm.x (if the msr module isn't builtin to the kernel).
So you could run './pcm.x 1 > box1.txt' on each system when they have about the same load.
Then after 30 seconds or so hit control-c to stop the data collection.
Pat

sridhar_basam · ‎02-03-2012

I am trying to run this on a production host and am having trouble on them, it just hangs after outputting the below,

Intel Performance Counter Monitor

It did work on my lab host though so not sure what is happening. The first time i ran it it complained about PMU being occupied by another process and if i wanted to reset it. I answered no, not sure if that caused it to hang the next time i ran it.

I will try running it another host on Monday and get you some output across the 2 hosts.

thanks,

Sridhar

Roman_D_Intel · ‎02-05-2012

Sridhar,

PCM complained about PMU because a different tool might have used it in the same time (or did not used it anymore but did not cleared up the state of the PMU on exit). Please see if oprofile can be a reason or that. You might try to reboot the system without oprofile (and its kernel modules if possible) andtry PCM again.

Thanks,
Roman

sridhar_basam · ‎02-08-2012

Sorry it took longer than anticipated. Roman, I had to reboot the hosts to be able to reset the PMU state. After successfully running ./pcm.x a couple of times in succession, it has again stopped working. It prints out the banner upto the Intel Copyright notice and just hangs.

Patrick, i collected the output from pcm.x 5 for about 60 seconds across the 2 nodes.

thanks,
Sridhar

Patrick_F_Intel1 · ‎02-08-2012

Hello Sridhar,
Thanks for the data.
I extracted just the lines with 'Gticks' (the lines starting with 'Instructions retired:')

If I look at the nhm databelow (I shortened some of the strings to make it fit on a line).

inst_ret: 5490 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 12.92 %
inst_ret: 5545 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.04 %
inst_ret: 6026 M ; Active cycles: 23 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.74 %
inst_ret: 6836 M ; Active cycles: 30 G ; Time (TSC): 13 Gticks ; non-halted residency: 16.04 %
inst_ret: 5958 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.57 %
inst_ret: 6122 M ; Active cycles: 23 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.60 %
inst_ret: 5741 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.30 %
inst_ret: 5707 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.23 %
inst_ret: 5975 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.46 %
inst_ret: 5789 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.36 %
inst_ret: 5579 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.15 %
inst_ret: 5634 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.18 %
inst_ret: 5617 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.15 %
inst_ret: 5660 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.29 %

The westmere data is below.

inst_ret: 5383 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.72 %
inst_ret: 5435 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.75 %
inst_ret: 5483 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.83 %
inst_ret: 6403 M ; Active cycles: 24 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.65 %
inst_ret: 5787 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.07 %
inst_ret: 5658 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.98 %
inst_ret: 5658 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.99 %
inst_ret: 5652 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.95 %
inst_ret: 5559 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.87 %
inst_ret: 5710 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.04 %
inst_ret: 5652 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.94 %
inst_ret: 12 G ; Active cycles: 35 G ; Time (TSC): 13 Gticks ; non-halted residency: 12.38 %
inst_ret: 5794 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.00 %
inst_ret: 5508 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.87 %

The first thing I notice is that the system is mostly idle (cpus halted) in both cases... only about 9% non-halted.
The interval is the same at 13 Gticks (GigaTicks (10^9 ticks). This is an elapsed TSC. Not a sum over all the cpus.
The number of instructions retired (per interval) is the about the same at about 5.6 GigaInstructions in both cases.
This makes me think the same workload is applied in both cases... that is good. Assuming that the workload is the same for both systems, then we are doing about the same amount of work (5.6 Giga Inst) per unit of work. That is good.
The number of 'Active cycles' is also the same on both system at about 22 GigaTicks.
This translates to about 2 cpus being kept busy (given that the elapsed TSC inverval is 13 Gticks).
The average '%non-halted cycles' per interval is about 13% on nhm and about 9% on wsm.
Since both system are doing the same work and using the same number clockticks to do the work, I would expect the nhm %utilization = wsm_utilization * 24_wsm_cpus/16_nhm_cpus = 9% * 24/16 = 13.5%.
So everything is making sense.

So it seems that, based on the pcm.x data, both systems are doing the same amount of work per interval, and using the same amount of clockticks per interval and the wsm system with more cpus has lower utilization at about 9% compared to the nhm utilization of 13%.

Does this make sense?

The next question is: How are you measuring utilization so that you show more utilization on westmere?

On another note, I'm assuming that something on the systems is kicking in and programming the PMUs and that is what is messing up pcm.
Some linux's use the PMU counters. I'm not sure which linux does this.
Pat

Roman_D_Intel · ‎02-09-2012

Sridnar,

regarding hanging: please try solution in this thread.

Roman

sridhar_basam · ‎02-09-2012

Yeah, what you presented makes sense. It shows essentially the same usage across the 2 hosts.

Our monitoring software reads the /proc entries for the PIDs in question to calculate the CPU usage, specifically the counter values for usermode clock ticks and kernelmode clock ticks. You observe the same behaviour if you use something like the "top" utility which again just parses those /proc entries. Based on the above would it be safe to say the kernel in question somehow isn't properly doing process accounting?

Roman, i tried the workaround you pointed to and that fixed it.

Thanks to the both of you for all of your help so far.

Sridhar

Patrick_F_Intel1 · ‎02-09-2012

I'm not saying that what you are seeing is not real.
I just don't see it in the data sent (which covers just 20 or so seconds for each system).
Do you have the sysstat utilities (sar, pidstat, mpstat, iostat) installed?
See site http://sebastien.godard.pagesperso-orange.fr/.
They are pretty easy to install if they are not already on the box.
These utilities will usuallyshow which process is taking up the time.
Pat

sridhar_basam · ‎02-09-2012

Yeah, we do have sysstat installed on all of our hosts. The processes are bound to specific cpu cores for efficiency. I have pasted the average cpu usage from sar over a 10 minute period for the CPU which has one of these processes bound.

Fields are

02:30:01 PM CPU %user %nice %system %iowait %steal %idle

Nehalem:

02:30:01 PM 2 5.76 0.00 1.02 0.00 0.00 93.21

Westmere:

02:30:01 PM 2 19.82 0.00 6.63 0.00 0.00 73.54

Sridhar

Patrick_F_Intel1 · ‎02-09-2012

I use a script like below to collect data.
It will collect data for 60 seconds.
If you want to collect the data (and if you can send the *.dat and *.tmp filesdata back in a .tar.gz file) I have another script to post process the files and chart them in an excel spreadsheet... assuming you have excel.
Pat

[bash]#!/bin/bash
rm -f sar_data.dat
rm -f sar_data.tmp
rm -f iostat_data.dat
rm -f pidstat_data.dat
rm -f vmstat_data.dat

SAMPLING_INTRVAL=1  # collect a sample every x number of seconds
ELAPSED_INTRVAL=60 # interval (in seconds) over which we will collect samples

#sar -bBqrRuvW -o sar_data.dat $SAMPLING_INTRVAL 0 >/dev/null 2>&1 &  # older sar uses this format
sar -bBqrRuvW -m ALL -o sar_data.dat $SAMPLING_INTRVAL >/dev/null 2>&1 &
pid_sar=$!
iostat -t -d -k -x $SAMPLING_INTRVAL > iostat_data.dat &
pid_ios=$!
pidstat -d -I -u -w -r $SAMPLING_INTRVAL > pidstat_data.dat &
pid_pid=$!
date +"date: %Y %m %d %H:%M:%S" > vmstat_data.dat
date +"date_sec: %s" >> vmstat_data.dat
echo interval: $SAMPLING_INTRVAL >> vmstat_data.dat
#vmstat -n $SAMPLING_INTRVAL >> vmstat_data.dat &
#pid_vms=$!
date +"EMON_TIME_BEGIN %s"
echo EMON_DATA_BEGIN
sleep $ELAPSED_INTRVAL
date +"EMON_TIME_END %s"
echo EMON_DATA_END
echo __DATA_END__
kill $pid_sar
kill $pid_ios
kill $pid_pid
#kill $pid_vms
sleep 1
sadf -p -P ALL sar_data.dat -- -wBbrRuWq -P ALL -I SUM -n DEV > sar_data.tmp




[/bash]