- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I know you know this but for others info...
The x5560 system is Nehalem-EP, 4 cores, 2.8Ghz, 8 MB LLC.
The x5660 system is Westmere-EP,6 cores, 2.8Ghz, 12 MB LLC.
So the Westmere based system should be better.
I've seen some software behave poorly due to the not-power-of-2 number of cores.
Also, now there are 50% more cpus on the system. This can expose poor scaling of locking schemes.
Have you tried booting the westmere box with just 4 cores per socket?
If the software goes back to the 4-5% cpu usage then we can probably assume the issue is not the box but probably related to going from 16 CPUs to 24 CPUs.
Also, how are you measuring performance?
How many units of work is each system able to do per second?
When you say "oprofile to see if there is a different pattern but see a similar pattern between the 2 systems, just that the newer system is using a lot more cycles"... this leaves me scratching my head...
Can you expand on 'similar pattern but new system is using a lot more cycles'?
Thanks
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me see if I understand correctly.
You collected the oprofile data from both systems and:
1) you ranfor theroughly the same amount of time,
2) you had roughly the same load level (in terms of pkts/sec) for both systems,
So the 2 systems should be doing the same amount of work during the interval
3) the profile of the oprofile data is the same (that is, the relative ratio of time spent in routines is about the same)
I assume the percentages above (like 10% in symbol1) means that 10% of the total clockticks is in symbol1
4) the number of clockticks in each routine is about 4x higher on westmere
5) the number of instructions retired in each routine is about 4x higher on westmere
Are these statements more or less right?
If so, it really makes me think that the software is somehow spinning looking for work.
That is, the amount of work getting done (total packets) is the same but it takes 4x more instructions to get the work done.
Would you mind running the Intel PCM utility on both systems under load and sending us the output?
You can send the output by private email if you are worried about proprietary information.
PCM is available at http://software.intel.com/en-us/articles/intel-performance-counter-monitor/ and is easy to build on linux (just unzip it and do 'make').
To run it do './pcm.x 1' or, during the load do './pcm.x 5 > nhm.txt' and './pcm.x 5 > wsm.txt'.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Running PCM would give us a measurement on both systems with the same tool.
It would help us get an overall view of CPU, QPIand memory performance.
The overhead is very low (neglible).
You might need to do a 'modprobe msr' before running pcm.x (if the msr module isn't builtin to the kernel).
So you could run './pcm.x 1 > box1.txt' on each system when they have about the same load.
Then after 30 seconds or so hit control-c to stop the data collection.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PCM complained about PMU because a different tool might have used it in the same time (or did not used it anymore but did not cleared up the state of the PMU on exit). Please see if oprofile can be a reason or that. You might try to reboot the system without oprofile (and its kernel modules if possible) andtry PCM again.
Thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry it took longer than anticipated. Roman, I had to reboot the hosts to be able to reset the PMU state. After successfully running ./pcm.x a couple of times in succession, it has again stopped working. It prints out the banner upto the Intel Copyright notice and just hangs.
Patrick, i collected the output from pcm.x 5 for about 60 seconds across the 2 nodes.
thanks,
Sridhar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the data.
I extracted just the lines with 'Gticks' (the lines starting with 'Instructions retired:')
If I look at the nhm databelow (I shortened some of the strings to make it fit on a line).
inst_ret: 5490 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 12.92 %
inst_ret: 5545 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.04 %
inst_ret: 6026 M ; Active cycles: 23 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.74 %
inst_ret: 6836 M ; Active cycles: 30 G ; Time (TSC): 13 Gticks ; non-halted residency: 16.04 %
inst_ret: 5958 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.57 %
inst_ret: 6122 M ; Active cycles: 23 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.60 %
inst_ret: 5741 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.30 %
inst_ret: 5707 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.23 %
inst_ret: 5975 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.46 %
inst_ret: 5789 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.36 %
inst_ret: 5579 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.15 %
inst_ret: 5634 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.18 %
inst_ret: 5617 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.15 %
inst_ret: 5660 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 13.29 %
inst_ret: 5383 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.72 %
inst_ret: 5435 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.75 %
inst_ret: 5483 M ; Active cycles: 21 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.83 %
inst_ret: 6403 M ; Active cycles: 24 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.65 %
inst_ret: 5787 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.07 %
inst_ret: 5658 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.98 %
inst_ret: 5658 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.99 %
inst_ret: 5652 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.95 %
inst_ret: 5559 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.87 %
inst_ret: 5710 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.04 %
inst_ret: 5652 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.94 %
inst_ret: 12 G ; Active cycles: 35 G ; Time (TSC): 13 Gticks ; non-halted residency: 12.38 %
inst_ret: 5794 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 9.00 %
inst_ret: 5508 M ; Active cycles: 22 G ; Time (TSC): 13 Gticks ; non-halted residency: 8.87 %
The first thing I notice is that the system is mostly idle (cpus halted) in both cases... only about 9% non-halted.
The interval is the same at 13 Gticks (GigaTicks (10^9 ticks). This is an elapsed TSC. Not a sum over all the cpus.
The number of instructions retired (per interval) is the about the same at about 5.6 GigaInstructions in both cases.
This makes me think the same workload is applied in both cases... that is good. Assuming that the workload is the same for both systems, then we are doing about the same amount of work (5.6 Giga Inst) per unit of work. That is good.
The number of 'Active cycles' is also the same on both system at about 22 GigaTicks.
This translates to about 2 cpus being kept busy (given that the elapsed TSC inverval is 13 Gticks).
The average '%non-halted cycles' per interval is about 13% on nhm and about 9% on wsm.
Since both system are doing the same work and using the same number clockticks to do the work, I would expect the nhm %utilization = wsm_utilization * 24_wsm_cpus/16_nhm_cpus = 9% * 24/16 = 13.5%.
So everything is making sense.
So it seems that, based on the pcm.x data, both systems are doing the same amount of work per interval, and using the same amount of clockticks per interval and the wsm system with more cpus has lower utilization at about 9% compared to the nhm utilization of 13%.
Does this make sense?
The next question is: How are you measuring utilization so that you show more utilization on westmere?
On another note, I'm assuming that something on the systems is kicking in and programming the PMUs and that is what is messing up pcm.
Some linux's use the PMU counters. I'm not sure which linux does this.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not saying that what you are seeing is not real.
I just don't see it in the data sent (which covers just 20 or so seconds for each system).
Do you have the sysstat utilities (sar, pidstat, mpstat, iostat) installed?
See site http://sebastien.godard.pagesperso-orange.fr/.
They are pretty easy to install if they are not already on the box.
These utilities will usuallyshow which process is taking up the time.
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It will collect data for 60 seconds.
If you want to collect the data (and if you can send the *.dat and *.tmp filesdata back in a .tar.gz file) I have another script to post process the files and chart them in an excel spreadsheet... assuming you have excel.
Pat
[bash]#!/bin/bash rm -f sar_data.dat rm -f sar_data.tmp rm -f iostat_data.dat rm -f pidstat_data.dat rm -f vmstat_data.dat SAMPLING_INTRVAL=1 # collect a sample every x number of seconds ELAPSED_INTRVAL=60 # interval (in seconds) over which we will collect samples #sar -bBqrRuvW -o sar_data.dat $SAMPLING_INTRVAL 0 >/dev/null 2>&1 & # older sar uses this format sar -bBqrRuvW -m ALL -o sar_data.dat $SAMPLING_INTRVAL >/dev/null 2>&1 & pid_sar=$! iostat -t -d -k -x $SAMPLING_INTRVAL > iostat_data.dat & pid_ios=$! pidstat -d -I -u -w -r $SAMPLING_INTRVAL > pidstat_data.dat & pid_pid=$! date +"date: %Y %m %d %H:%M:%S" > vmstat_data.dat date +"date_sec: %s" >> vmstat_data.dat echo interval: $SAMPLING_INTRVAL >> vmstat_data.dat #vmstat -n $SAMPLING_INTRVAL >> vmstat_data.dat & #pid_vms=$! date +"EMON_TIME_BEGIN %s" echo EMON_DATA_BEGIN sleep $ELAPSED_INTRVAL date +"EMON_TIME_END %s" echo EMON_DATA_END echo __DATA_END__ kill $pid_sar kill $pid_ios kill $pid_pid #kill $pid_vms sleep 1 sadf -p -P ALL sar_data.dat -- -wBbrRuWq -P ALL -I SUM -n DEV > sar_data.tmp
[/bash]

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page