Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

[PCM] QPI traffic reported all zeros

Thomas_W_Intel
Employee
3,733 Views

Zheng L. posted:

Hello everyone, I try to get some data from the Intel Xeon E5-2687W by using PCU. Beause of the project reason, we are mainly interested finding in how the multi-thread using QPI to get the reading from the PCI card may effect the system. However, The incoming data traffic of QPI are always 0 and the outgoing data traffic are always 0 too. And I get some weird reading to. Is that possible that the reading is wrong?

Also I have two screen shots of that, but I find that the forum can not upload the picture. Is there any way that I can upload the picture so someone can help me analyze that?

Thank you very much.

Might it be that you have a second instance of PCM running? It might also be one that was not cleanly shut down?

 

0 Kudos
49 Replies
Roman_D_Intel
Employee
2,231 Views

 a screenshot of the complete output would be very helpful. Alternatively copy&paste it in the comment text.

--

Roman

0 Kudos
McCalpinJohn
Honored Contributor III
2,231 Views

If you are running a recent version of Linux and using the "perf" interface to the QPI counters, there is an error in the definition of the predefined QPI events for cacheable and non-cacheable data blocks transferred.  In both cases, the standard distributions fail to set the "extra bit" that is needed for those events.  Fortunately, it can be set manually using the "perf" interface.

The reference for the Linux kernel patch is https://lkml.org/lkml/2013/8/2/482

To set the bit manually, note that the prededined event programmed with the command:
                 # perf -e "uncore_qpi_0/event=drs_data/"
Is the same as
                 # perf -e "uncore_qpi_0/event=0x02,umask=0x08/"
But it should be
                 # perf -e "uncore_qpi_0/event=0x102,umask=0x08/"

This last command returns the expected number of data cache lines transferred when I run the STREAM benchmark in a cross-socket configuration.  The same change to the event number causes the "ncb_data" event to return non-zero values as well, but I don't have a test case for that event. 

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Thank you very much. I already uploaded the screenshots. Hope thoses will help.

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

 a screenshot of the complete output would be very helpful. Alternatively copy&paste it in the comment text.

--

Roman

Hello Roman,

I aleady uploaded the screenshop, hope that you help me with me problem.

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Thomas Willhalm (Intel) wrote:

Zheng L. posted:

Hello everyone, I try to get some data from the Intel Xeon E5-2687W by using PCU. Beause of the project reason, we are mainly interested finding in how the multi-thread using QPI to get the reading from the PCI card may effect the system. However, The incoming data traffic of QPI are always 0 and the outgoing data traffic are always 0 too. And I get some weird reading to. Is that possible that the reading is wrong?

Also I have two screen shots of that, but I find that the forum can not upload the picture. Is there any way that I can upload the picture so someone can help me analyze that?

Thank you very much.

Might it be that you have a second instance of PCM running? It might also be one that was not cleanly shut down?

 

Hello, I aleady uploaded the screenshot. Hope that helps. I am sure that I only run one instance of the program when I take the screen shot.

0 Kudos
Zheng_Luo
Beginner
2,231 Views

John D. McCalpin wrote:

If you are running a recent version of Linux and using the "perf" interface to the QPI counters, there is an error in the definition of the predefined QPI events for cacheable and non-cacheable data blocks transferred.  In both cases, the standard distributions fail to set the "extra bit" that is needed for those events.  Fortunately, it can be set manually using the "perf" interface.

The reference for the Linux kernel patch is https://lkml.org/lkml/2013/8/2/482

To set the bit manually, note that the prededined event programmed with the command:
                 # perf -e "uncore_qpi_0/event=drs_data/"
Is the same as
                 # perf -e "uncore_qpi_0/event=0x02,umask=0x08/"
But it should be
                 # perf -e "uncore_qpi_0/event=0x102,umask=0x08/"

This last command returns the expected number of data cache lines transferred when I run the STREAM benchmark in a cross-socket configuration.  The same change to the event number causes the "ncb_data" event to return non-zero values as well, but I don't have a test case for that event. 

Hello John,

I have not try perf yet. I just tried the PCM of Intel. I will try perf to get the some result too.

0 Kudos
Roman_D_Intel
Employee
2,231 Views

Zheng L.,

could you please post the whole output including all the messages in the beginning just from the program invocation. There should be a couple of diagnostic messages that help to understand the issue.

Thanks,

Roman

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

Zheng L.,

could you please post the whole output including all the messages in the beginning just from the program invocation. There should be a couple of diagnostic messages that help to understand the issue.

Thanks,

Roman

[root@cybermech IntelPerformanceCounterMonitorV2.5.1]# ./pcm.x 10

 Intel(r) Performance Counter Monitor V2.5.1 (2013-06-25 13:44:03 +0200 ID=76b6d1f)

 Copyright (c) 2009-2012 Intel Corporation

Num logical cores: 16
Num sockets: 2
Threads per core: 1
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 8
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3100000000 Hz
Package thermal spec power: 150 Watt; Package minimum power: 65 Watt; Package maximum power: 230 Watt; 
ERROR: Requested bus number 64 is larger than the max bus number 63
Can not access SNB-EP (Jaketown) PCI configuration space. Access to uncore counters (memory and QPI bandwidth) is disabled.
You must be root to access these SNB-EP counters in PCM. 
Number of PCM instances: 2

Detected Intel(R) Xeon(R) CPU E5-2687W 0 @ 3.10GHz "Intel(r) microarchitecture codename Sandy Bridge-EP/Jaketown"

 EXEC  : instructions per nominal CPU cycle
 IPC   : instructions per CPU cycle
 FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
 AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
 L3MISS: L3 cache misses 
 L2MISS: L2 cache misses (including other core's L2 cache *hits*) 
 L3HIT : L3 cache hit ratio (0.00-1.00)
 L2HIT : L2 cache hit ratio (0.00-1.00)
 L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
 L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
 READ  : bytes read from memory controller (in GBytes)
 WRITE : bytes written to memory controller (in GBytes)
 TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature


 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP

   0    0     0.12   1.43   0.08    1.00     110 K   4508 K    0.98    0.47    0.01    0.07     N/A     N/A     36
   1    0     0.06   1.29   0.05    1.00      96 K   3233 K    0.97    0.42    0.01    0.08     N/A     N/A     33
   2    0     0.02   1.20   0.01    1.00      22 K    894 K    0.97    0.46    0.01    0.08     N/A     N/A     34
   3    0     0.13   1.78   0.07    1.00      46 K   1806 K    0.97    0.68    0.00    0.03     N/A     N/A     32
   4    0     0.00   0.77   0.00    1.00    4006       55 K    0.93    0.64    0.04    0.10     N/A     N/A     34
   5    0     0.00   1.18   0.00    1.00    3611       68 K    0.95    0.50    0.03    0.10     N/A     N/A     32
   6    0     0.00   1.22   0.00    1.00    2468       54 K    0.95    0.41    0.02    0.08     N/A     N/A     31
   7    0     0.00   0.79   0.00    1.00      16 K    208 K    0.92    0.48    0.05    0.13     N/A     N/A     34
   8    1     0.00   1.18   0.00    1.00      48 K    230 K    0.79    0.47    0.10    0.08     N/A     N/A     22
   9    1     0.01   1.17   0.01    1.00      78 K    453 K    0.83    0.38    0.06    0.06     N/A     N/A     23
  10    1     0.00   1.39   0.00    1.00    7367       45 K    0.84    0.54    0.05    0.05     N/A     N/A     23
  11    1     0.00   0.74   0.00    1.00    1211     7967      0.85    0.34    0.09    0.13     N/A     N/A     22
  12    1     0.00   0.88   0.00    1.00    1002     5663      0.82    0.33    0.11    0.12     N/A     N/A     22
  13    1     0.00   0.96   0.00    1.00     818     4268      0.81    0.35    0.11    0.11     N/A     N/A     22
  14    1     0.00   0.96   0.00    1.00     779     3867      0.80    0.31    0.11    0.11     N/A     N/A     23
  15    1     0.00   1.19   0.00    1.00    7827       25 K    0.69    0.35    0.09    0.05     N/A     N/A     21
-------------------------------------------------------------------------------------------------------------------
 SKT    0     0.04   1.49   0.03    1.00     302 K     10 M    0.97    0.51    0.01    0.06    0.00    0.00     31
 SKT    1     0.00   1.18   0.00    1.00     145 K    776 K    0.81    0.42    0.07    0.06    0.00    0.00     21
-------------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.02   1.47   0.01    1.00     448 K     11 M    0.96    0.50    0.01    0.06    0.00    0.00     N/A

 Instructions retired:   10 G ; Active cycles: 7274 M ; Time (TSC):   30 Gticks ; C0 (active,non-halted) core residency: 1.47 %

 C1 core residency: 98.53 %; C3 core residency: 0.00 %; C6 core residency: 0.00 %; C7 core residency: 0.00 %
 C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

 PHYSICAL CORE IPC                 : 1.47 => corresponds to 36.86 % utilization for cores in active state
 Instructions per nominal CPU cycle: 0.02 => corresponds to 0.54 % core utilization over time interval

Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

               QPI0     QPI1    |  QPI0   QPI1  
----------------------------------------------------------------------------------------------
 SKT    0        0        0     |  -2147483648%   -2147483648%   
 SKT    1        0        0     |  -2147483648%   -2147483648%   
----------------------------------------------------------------------------------------------
Total QPI incoming data traffic:    0       QPI data traffic/Memory controller traffic: -nan

Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

               QPI0     QPI1    |  QPI0   QPI1  
----------------------------------------------------------------------------------------------
 SKT    0     9223372 T   9223372 T   |  -2147483648%   -2147483648%   
 SKT    1     9223372 T   9223372 T   |  -2147483648%   -2147483648%   
----------------------------------------------------------------------------------------------
Total QPI outgoing data and non-data traffic:    0  

----------------------------------------------------------------------------------------------
 SKT    0 package consumed 549.47 Joules
 SKT    1 package consumed 560.42 Joules
----------------------------------------------------------------------------------------------
 TOTAL:                    1109.89 Joules

----------------------------------------------------------------------------------------------
 SKT    0 DIMMs consumed 0.00 Joules
 SKT    1 DIMMs consumed 0.00 Joules
----------------------------------------------------------------------------------------------
 TOTAL:                  0.00 Joules

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

Zheng L.,

could you please post the whole output including all the messages in the beginning just from the program invocation. There should be a couple of diagnostic messages that help to understand the issue.

Thanks,

Roman

Hi Roman,

I already copied all the output, hope that will be more helpful.

0 Kudos
Roman_D_Intel
Employee
2,231 Views

Thanks. This output is helpful. Could you please send the output of lspci command (as root) for further diagnosis?

--

Roman

0 Kudos
Roman_D_Intel
Employee
2,231 Views

Also could you specify the vendor of the system (Dell/HP/etc) and the BIOS vendor/version (seen in the output of Linux "dmidecode" command)?

Thanks,

Roman

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

Thanks. This output is helpful. Could you please send the output of lspci command (as root) for further diagnosis?

--

Roman

Hello Roman,

I have put the all the output of the lspci into the lspci file. You can look at it.

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

Also could you specify the vendor of the system (Dell/HP/etc) and the BIOS vendor/version (seen in the output of Linux "dmidecode" command)?

Thanks,

Roman

Hi Roman,

The system is Dell, and I upload output of the dmidecode command as an attachment, dmidecode.txt.

0 Kudos
Roman_D_Intel
Employee
2,231 Views

Zheng Luo,

thanks a lot for the detailed output. I have developed a patch that will allow you to show memory bandwidth from memory controller (attached). Apply it using "patch < bus_dell_patch.txt".

According to lspci output your BIOS hides the QPI performance monitoring devices therefore QPI statistics can not be available (you should see a message similar to one decribed in this article). Could you please try to find a newer BIOS for your system install it and try PCM again? It could be that you need to enable QPI/perfmon/PCM option (or similar) in the BIOS to unhide the QPI performance monitoring devices. If you can't find the option you might ask the vendor to provide such option.

Best regards,

Roman

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

Zheng Luo,

thanks a lot for the detailed output. I have developed a patch that will allow you to show memory bandwidth from memory controller (attached). Apply it using "patch < bus_dell_patch.txt".

According to lspci output your BIOS hides the QPI performance monitoring devices therefore QPI statistics can not be available (you should see a message similar to one decribed in this article). Could you please try to find a newer BIOS for your system install it and try PCM again? It could be that you need to enable QPI/perfmon/PCM option (or similar) in the BIOS to unhide the QPI performance monitoring devices. If you can't find the option you might ask the vendor to provide such option.

Best regards,

Roman

Hello Roman, 

Thank you very much for the help. I really apprieated it. I updated the BIOS of Dell T7600 from A5 to A9. You can see the attachement that I attached ()  dmidecode_after_BIOS_update.txt. I compared it with the previous one. It seems that there are not too many changes. The  diff_result.txt will show the difference between the previous dmidecoe result and the result after the BIOS update.

I updated the PCM by using the your patch. Thanks for you patch.  I run the PCM, the PQI reading still have the problem. I attached the new reading too (PCM_output.txt).

I checked the setting in the BIOS, I did not find anything that is related to setting the QPI. In our BIOS setting, we did disable the HyperThread, Intel TurboBoost to make things easier for our project. I think will not effect the QPI result of PCM. Am I right?

You said that If I can't find the option you might ask the vendor to provide such option, and I will try that. Really grateful to your help.

Zheng Luo

0 Kudos
Roman_D_Intel
Employee
2,231 Views

Zheng Luo,

This version of BIOS seems to be much better. Your BIOS settings changes should not impact QPI perfmon device visibility. Could you please send me lspci output?

I noticed in your PCM output the line "Number of PCM instances: 2". It may happen that you run two instances of PCM or sometime ago a PCM instance has been killed unexpectedly such that PCM instance counting becomes broken (this may also break QPI statistics display in this version of PCM: to be fixed in the next release). To clean the instance counting please do the following:

1. stop all pcm instances

2. as root: rm -rf /dev/shm/sem.*Intel*

3. start pcm again

Please share the output of pcm started after you did these operation.

Thanks for your cooperation,

Roman

0 Kudos
McCalpinJohn
Honored Contributor III
2,231 Views

When we got the BIOS update from Dell that supported QPI performance counter access, the default setting was to not enable access.  We had to select an option to enable QPI performance counter access and then reboot.  Fortunately this only needed to be done once.

0 Kudos
Zheng_Luo
Beginner
2,231 Views

John D. McCalpin wrote:

When we got the BIOS update from Dell that supported QPI performance counter access, the default setting was to not enable access.  We had to select an option to enable QPI performance counter access and then reboot.  Fortunately this only needed to be done once.

Hello John,

Are you using the same computer as me ? I am using Dell T7600, what model are you using? Even after the update, I can still not see the QPI option. If you are using the same model as me, can you tell me where I can find the option in the BIOS that can turn on the QPI? Thank you very much. 

Zheng Luo

0 Kudos
Zheng_Luo
Beginner
2,231 Views

Roman Dementiev (Intel) wrote:

Zheng Luo,

This version of BIOS seems to be much better. Your BIOS settings changes should not impact QPI perfmon device visibility. Could you please send me lspci output?

I noticed in your PCM output the line "Number of PCM instances: 2". It may happen that you run two instances of PCM or sometime ago a PCM instance has been killed unexpectedly such that PCM instance counting becomes broken (this may also break QPI statistics display in this version of PCM: to be fixed in the next release). To clean the instance counting please do the following:

1. stop all pcm instances

2. as root: rm -rf /dev/shm/sem.*Intel*

3. start pcm again

Please share the output of pcm started after you did these operation.

Thanks for your cooperation,

Roman

Thank you very much Roman,

Now the PCM seems working now, at least there is no arbitrary value in the regarding to QPI reading. However I tried to use the commands mentioned in http://software.intel.com/en-us/forums/topic/280235#comment-1755207 to generated some traffic for QPI. The commands that I used are numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 and numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024. The strange thing is that all the data in the QPI field is zero, I don't why? Is that because those command does not generate any traffic for the QPI. If so, how can I generate traffic for the QPI?

By the way, how do you close the PCM instance? I just use Ctrl + C to close the program...

Zheng Luo

0 Kudos
McCalpinJohn
Honored Contributor III
2,092 Views

We are running Dell DCS8000 systems, so the BIOS is likely different than yours.   I don't know what command was used to enable the PCI configuration space areas for the QPI counters and was not able to find it in the list of options that I checked.   It is probably best to follow up with Dell.

0 Kudos
Reply