Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Nehalem-EX PCM

Kurt_Miller
Beginner
1,360 Views
I'm looking for some help to interpret the PCM output for our application running on Linux 2.6.18-194.11.4.el5. The application is mutli-threaded. When we run a few copies of the application each process utilizes over 100% CPU as reported by top, however when we run 64 copies, each process runs < 50% CPU and the total CPU utilization is around 50%. We suspect that memory bandwidth or other hardware resource is being hit. I'm having difficulty interpreting the PCM output below.

This is a 10 second sample. Am I hitting memory bandwidth or other hardware limits?


Intel Performance Counter Monitor V2.1 (2012-05-31 14:40:57 +0200 ID=2d18fd5)

Copyright (c) 2009-2012 Intel Corporation

Num (logical) cores: 64
Num sockets: 4
Threads per core: 2
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2266666661 Hz
Max QPI link speed: 12.8 GBytes/second (6.4 GT/second)

Detected processor(s) with Intel microarchitecture codename Nehalem-EX


EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature


Core (SKT) | EXEC | IPC | FREQ | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK | READ | WRITE | TEMP

0 0 0.27 0.36 0.74 1.00 29 M 58 M 0.49 0.36 0.32 0.07 N/A N/A 26
1 1 0.76 1.04 0.73 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 18
2 2 0.75 1.01 0.74 1.00 17 M 25 M 0.34 0.34 0.18 0.02 N/A N/A 26
3 3 0.77 1.02 0.76 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 16
4 0 0.54 0.92 0.59 1.00 13 M 21 M 0.38 0.32 0.18 0.02 N/A N/A 20
5 1 0.74 1.00 0.74 1.00 17 M 25 M 0.33 0.34 0.18 0.02 N/A N/A 17
6 2 0.75 1.03 0.73 1.00 17 M 26 M 0.34 0.34 0.19 0.02 N/A N/A 23
7 3 0.76 1.03 0.74 1.00 17 M 25 M 0.33 0.34 0.18 0.02 N/A N/A 18
8 0 0.53 0.89 0.60 1.00 12 M 20 M 0.39 0.35 0.17 0.02 N/A N/A 22
9 1 0.75 1.01 0.74 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 18
10 2 0.75 1.03 0.72 1.00 17 M 25 M 0.34 0.34 0.18 0.02 N/A N/A 25
11 3 0.78 1.04 0.76 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 17
12 0 0.50 0.84 0.59 1.00 13 M 21 M 0.38 0.32 0.18 0.02 N/A N/A 22
13 1 0.78 1.03 0.75 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 18
14 2 0.74 1.01 0.73 1.00 16 M 25 M 0.34 0.34 0.18 0.02 N/A N/A 29
15 3 0.80 1.06 0.75 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 19
16 0 0.49 0.86 0.57 1.00 12 M 20 M 0.38 0.32 0.17 0.02 N/A N/A 22
17 1 0.79 1.03 0.77 1.00 17 M 26 M 0.33 0.35 0.18 0.02 N/A N/A 12
18 2 0.78 1.03 0.75 1.00 17 M 26 M 0.34 0.34 0.19 0.02 N/A N/A 22
19 3 0.79 1.06 0.74 1.00 17 M 26 M 0.34 0.34 0.18 0.02 N/A N/A 11
20 0 0.55 0.92 0.60 1.00 13 M 21 M 0.39 0.34 0.17 0.02 N/A N/A 20
21 1 0.79 1.04 0.76 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 13
22 2 0.74 1.00 0.74 1.00 17 M 26 M 0.34 0.34 0.18 0.02 N/A N/A 21
23 3 0.77 1.03 0.74 1.00 17 M 25 M 0.33 0.34 0.18 0.02 N/A N/A 13
24 0 0.54 0.94 0.58 1.00 13 M 21 M 0.38 0.33 0.18 0.02 N/A N/A 22
25 1 0.75 1.01 0.74 1.00 17 M 25 M 0.34 0.34 0.18 0.02 N/A N/A 16
26 2 0.74 1.02 0.73 1.00 17 M 26 M 0.34 0.34 0.19 0.02 N/A N/A 22
27 3 0.76 1.02 0.74 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 11
28 0 0.55 0.93 0.59 1.00 14 M 22 M 0.38 0.34 0.19 0.02 N/A N/A 24
29 1 0.74 1.00 0.73 1.00 16 M 24 M 0.33 0.34 0.18 0.02 N/A N/A 17
30 2 0.76 1.02 0.75 1.00 17 M 27 M 0.34 0.34 0.19 0.02 N/A N/A 22
31 3 0.86 1.07 0.80 1.00 18 M 27 M 0.34 0.35 0.18 0.02 N/A N/A 12
32 0 0.27 0.36 0.74 1.00 13 M 22 M 0.39 0.32 0.15 0.02 N/A N/A 25
33 1 0.76 1.04 0.73 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 18
34 2 0.77 1.04 0.74 1.00 17 M 26 M 0.34 0.35 0.19 0.02 N/A N/A 27
35 3 0.83 1.09 0.76 1.00 18 M 27 M 0.33 0.34 0.19 0.02 N/A N/A 16
36 0 0.60 1.02 0.59 1.00 14 M 22 M 0.37 0.33 0.19 0.02 N/A N/A 21
37 1 0.80 1.09 0.74 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 17
38 2 0.74 1.02 0.73 1.00 17 M 26 M 0.34 0.33 0.19 0.02 N/A N/A 23
39 3 0.79 1.06 0.74 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 17
40 0 0.63 1.05 0.60 1.00 14 M 24 M 0.40 0.38 0.19 0.03 N/A N/A 22
41 1 0.80 1.07 0.74 1.00 18 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 17
42 2 0.76 1.04 0.72 1.00 17 M 25 M 0.34 0.34 0.19 0.02 N/A N/A 25
43 3 0.79 1.04 0.76 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 17
44 0 0.61 1.02 0.59 1.00 14 M 23 M 0.38 0.35 0.19 0.03 N/A N/A 22
45 1 0.80 1.07 0.75 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 17
46 2 0.78 1.07 0.73 1.00 17 M 26 M 0.34 0.34 0.19 0.02 N/A N/A 29
47 3 0.77 1.02 0.75 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 19
48 0 0.60 1.05 0.57 1.00 14 M 22 M 0.37 0.33 0.20 0.02 N/A N/A 22
49 1 0.86 1.11 0.77 1.00 18 M 27 M 0.34 0.35 0.18 0.02 N/A N/A 13
50 2 0.79 1.05 0.75 1.00 18 M 27 M 0.33 0.33 0.19 0.02 N/A N/A 22
51 3 0.76 1.03 0.74 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 12
52 0 0.62 1.04 0.60 1.00 13 M 22 M 0.38 0.33 0.18 0.02 N/A N/A 21
53 1 0.82 1.09 0.75 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 14
54 2 0.78 1.06 0.74 1.00 17 M 27 M 0.34 0.34 0.19 0.02 N/A N/A 21
55 3 0.80 1.08 0.74 1.00 17 M 26 M 0.33 0.34 0.18 0.02 N/A N/A 12
56 0 0.57 0.99 0.58 1.00 13 M 21 M 0.37 0.32 0.19 0.02 N/A N/A 22
57 1 0.81 1.09 0.74 1.00 17 M 26 M 0.34 0.34 0.19 0.02 N/A N/A 16
58 2 0.75 1.03 0.73 1.00 17 M 26 M 0.34 0.35 0.19 0.02 N/A N/A 22
59 3 0.78 1.05 0.74 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 11
60 0 0.58 0.99 0.59 1.00 14 M 22 M 0.38 0.33 0.19 0.02 N/A N/A 24
61 1 0.81 1.11 0.73 1.00 17 M 26 M 0.33 0.34 0.19 0.02 N/A N/A 17
62 2 0.78 1.05 0.75 1.00 18 M 27 M 0.34 0.34 0.19 0.02 N/A N/A 22
63 3 0.86 1.08 0.80 1.00 18 M 27 M 0.33 0.35 0.18 0.02 N/A N/A 12
-------------------------------------------------------------------------------------------------------------------
SKT 0 0.53 0.87 0.61 1.00 236 M 392 M 0.40 0.34 0.19 0.03 39.83 23.33 N/A
SKT 1 0.79 1.05 0.75 1.00 281 M 422 M 0.33 0.34 0.18 0.02 32.67 20.33 N/A
SKT 2 0.76 1.03 0.74 1.00 280 M 425 M 0.34 0.34 0.19 0.02 31.92 19.51 N/A
SKT 3 0.79 1.05 0.75 1.00 284 M 426 M 0.33 0.34 0.18 0.02 33.07 20.12 N/A
-------------------------------------------------------------------------------------------------------------------
TOTAL * 0.72 1.01 0.71 1.00 1084 M 1669 M 0.35 0.34 0.19 0.02 137.73 83.44 N/A

Instructions retired: 1057 G ; Active cycles: 1049 G ; Time (TSC): 23 Gticks ; C0 (active,non-halted) core residency: 71.08 %

C3 core residency: 0.81 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %
C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE IPC : 2.02 => corresponds to 50.39 % utilization for cores in active state
Instructions per nominal CPU cycle: 1.43 => corresponds to 35.82 % core utilization over time interval

Intel QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

QPI0 QPI1 QPI2 QPI3 | QPI0 QPI1 QPI2 QPI3
----------------------------------------------------------------------------------------------
SKT 0 18 G 126 M 8047 M 8397 M | 14% 0% 6% 6%
SKT 1 8906 M 9432 M 19 G 9068 M | 6% 7% 15% 6%
SKT 2 56 M 9085 M 8580 M 18 G | 0% 6% 6% 14%
SKT 3 9357 M 8844 M 19 G 8917 M | 7% 6% 15% 6%
----------------------------------------------------------------------------------------------
Total QPI incoming data traffic: 165 G QPI data traffic/Memory controller traffic: 0.70

Intel QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

QPI0 QPI1 QPI2 QPI3 | QPI0 QPI1 QPI2 QPI3
----------------------------------------------------------------------------------------------
SKT 0 23 G 1100 M 20 G 20 G | 18% 0% 15% 16%
SKT 1 20 G 34 G 5404 M 20 G | 15% 26% 4% 16%
SKT 2 1027 M 20 G 20 G 23 G | 0% 15% 16% 17%
SKT 3 21 G 33 G 5840 M 21 G | 16% 25% 4% 16%
----------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic: 295 G
0 Kudos
1 Solution
Roman_D_Intel
Employee
1,360 Views
Hi Kurt Miller,
looking at the memory bandwidth metrics in the output one sees that the hardware memory bandwidth limit is not hit (you have just max 4 GByte/sec read, and 2 GByte/sec write per socket).
Each process runs < 50% CPU time according to top when 64 processes are run: this indicates that CPU time is not fully used by the processes. Most likely your 64 processes wait for disk I/O or other external bottleneck (network I/O, locks, etc. ). You can try to use Intel VTune Amplifier XE Locks&Wait analysis to see the source of waiting.
--
Roman

View solution in original post

0 Kudos
2 Replies
Roman_D_Intel
Employee
1,361 Views
Hi Kurt Miller,
looking at the memory bandwidth metrics in the output one sees that the hardware memory bandwidth limit is not hit (you have just max 4 GByte/sec read, and 2 GByte/sec write per socket).
Each process runs < 50% CPU time according to top when 64 processes are run: this indicates that CPU time is not fully used by the processes. Most likely your 64 processes wait for disk I/O or other external bottleneck (network I/O, locks, etc. ). You can try to use Intel VTune Amplifier XE Locks&Wait analysis to see the source of waiting.
--
Roman
0 Kudos
Kurt_Miller
Beginner
1,360 Views
Hi Roman,

Thank you for your reply. It is good to know we're not hitting the memory bandwidth limit. It means we can make it faster once the bottle neck has been identified.

I have used VTune on the application and it appears that it doesn't capture our IO in the Locks & Wait analysis. Our IO is via /dev/sgXX using ioctl(fd, SG_IO, &io_hdr). I think there is a bug with Vtune where it doesn't capture wait time in ioctl(). I'm using vtune_amplifier_xe_2011_update9.tar.gz.

Regards,
-Kurt
0 Kudos
Reply