Software Archive
Read-only legacy content
17061 Discussions

Xeon Phi KNC performance degradation after re-installation. Why? System or compiling wrong?

user1900
Beginner
1,289 Views

The question is: Why the GFlops/s. of the hello-flops1.c example is almost half the performance someone got with my machine a couple of years before (in theory, with another installation of the system)?

Do you know what could be wrong in my system? Any BIOS configuration? Any package that should be reinstalled in the Phi, or in the Host? Wrong modules? Wrong compilation options?

I have in my hands the book Intel Xeon Phi from Jeffers and Reinders (1st edition) and I am following the example hello-flops1.c from the Chapter 2, in page 29.

I have a Xeon Phi KNC 7120P.

Reading here:
https://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core

- Clock Freq: 1.24Ghz (Max Turbo 1.33Ghz)
- Cores: 61
- Threads: 244
- Memory: 16GB
- Max Memory bandwidth: 352 GB/s

So, I have 16GB of memory compared with 8GB in the book, and my clock frequency is a bit higher (1.24 base) compared with 1.091.

When he runs his example, he gets:
- 17.206 GFlops/s.

When I run the same example in my machine, I get:
- 9.544 GFlops/s.

But, someone, some years ago ran the same example and got a number really close to that shown in the book (twice my performance nowadays).


The code is exactly this: https://github.com/intel-unesp-mcp/infieri-2017-basic/blob/master/src/hello-flops1.c


I have compiled it with `icc -mmic -O3 hello-flops1.c -o hello-flops1`.

Also, dump report logs with `icc -qopt-report=3 -mmic -O3`, reading that the loops were vectorized:

Intel(R) Advisor can now assist with vectorization and show optimization
  report messages with your source code.
See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) MIC Architecture, Version 17.0.4.196 Build 20170411

Compiler options: -qopt-report=3 -mmic -O3 -o hello-flops1

    Report from: Interprocedural optimizations [ipo]

INLINING OPTION VALUES:
  -inline-factor: 100
  -inline-min-size: 30
  -inline-max-size: 230
  -inline-max-total-size: 2000
  -inline-max-per-routine: 10000
  -inline-max-per-compile: 500000


Begin optimization report for: main(int, char **)

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (main(int, char **)) [1] hello-flops1.c(42,1)
  -> EXTERN: (53,9) printf(const char *__restrict__, ...)
  -> EXTERN: (60,9) printf(const char *__restrict__, ...)
  -> INLINE: (62,18) dtime()
    -> EXTERN: (23,5) gettimeofday(struct timeval *__restrict__, __timezone_ptr_t)
  -> INLINE: (75,18) dtime()
    -> EXTERN: (23,5) gettimeofday(struct timeval *__restrict__, __timezone_ptr_t)
  -> EXTERN: (88,14) printf(const char *__restrict__, ...)


    Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par]


LOOP BEGIN at hello-flops1.c(54,9)
   remark #15300: LOOP WAS VECTORIZED
   remark #15467: unmasked aligned streaming stores: 2
   remark #15475: --- begin vector cost summary ---
   remark #15476: scalar cost: 17
   remark #15477: vector cost: 1.620
   remark #15478: estimated potential speedup: 10.460
   remark #15487: type converts: 6
   remark #15488: --- end vector cost summary ---
   remark #25015: Estimate of max trip count of loop=65536
LOOP END

LOOP BEGIN at hello-flops1.c(64,9)
   remark #15542: loop was not vectorized: inner loop was already vectorized
   remark #25438: unrolled without remainder by 2
   remark #25015: Estimate of max trip count of loop=50000000

   LOOP BEGIN at hello-flops1.c(70,13)
      remark #15300: LOOP WAS VECTORIZED
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 8
      remark #15477: vector cost: 0.430
      remark #15478: estimated potential speedup: 18.280
      remark #15488: --- end vector cost summary ---
   LOOP END
LOOP END

    Report from: Code generation optimizations [cg]

hello-flops1.c(42,1):remark #34051: REGISTER ALLOCATION : [main] hello-flops1.c:42

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   63[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm31 k0-k7]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :   36[ rax rdx rsi rdi zmm0-zmm27 k0-k3]

    Routine temporaries
        Total         :     151
            Global    :      27
            Local     :     124
        Regenerable   :      30
        Spilled       :       1

    Routine stack
        Variables     :      36 bytes*
            Reads     :      10 [4.50e+00 ~ 0.0%]
            Writes    :       2 [2.00e+00 ~ 0.0%]
        Spills        :       8 bytes*
            Reads     :       1 [0.00e+00 ~ 0.0%]
            Writes    :       1 [0.00e+00 ~ 0.0%]

    Notes

        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.


===========================================================================

Begin optimization report for: dtime()

    Report from: Interprocedural optimizations [ipo]

INLINE REPORT: (dtime()) [2] hello-flops1.c(20,1)
  -> EXTERN: (23,5) gettimeofday(struct timeval *__restrict__, __timezone_ptr_t)


    Report from: Code generation optimizations [cg]

hello-flops1.c(20,1):remark #34051: REGISTER ALLOCATION : [dtime] hello-flops1.c:20

    Hardware registers
        Reserved     :    2[ rsp rip]
        Available    :   63[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm31 k0-k7]
        Callee-save  :    6[ rbx rbp r12-r15]
        Assigned     :    9[ rax rdx rsi rdi zmm0-zmm3 k1]

    Routine temporaries
        Total         :      20
            Global    :       6
            Local     :      14
        Regenerable   :       4
        Spilled       :       0

    Routine stack
        Variables     :      16 bytes*
            Reads     :       4 [4.00e+00 ~ 17.4%]
            Writes    :       0 [0.00e+00 ~ 0.0%]
        Spills        :       0 bytes*
            Reads     :       0 [0.00e+00 ~ 0.0%]
            Writes    :       0 [0.00e+00 ~ 0.0%]

    Notes

        *Non-overlapping variables and spills may share stack space,
         so the total stack size might be less than this.


===========================================================================

 

Information from my Host and checking the Phi:

uname -a
Linux batel.atc.unican.es 4.4.4-1.el7.elrepo.x86_64 #1 SMP Fri Mar 4 11:09:10 EST 2016 x86_64 x86_64 x86_64 GNU/Linux

service mpss status
mpss is running

modinfo mic
filename:       /lib/modules/4.4.4-1.el7.elrepo.x86_64/extra/mic.ko
license:        GPL
build_scmver:   e8ef53c4fa26582ac37b5e0101b7451a70263f6c
build_ondate:   2017-11-17 18:51:24 +0100
build_bywhom:   user@domain.es
build_number:   0
license:        GPL
license:        GPL
srcversion:     CD00183CAD76D762A01EBD7
depends:
vermagic:       4.4.4-1.el7.elrepo.x86_64 SMP mod_unload modversions
parm:           vnet:Vnet operating mode, one of: poll intr dma (vnetmode)
parm:           vnet_num_buffers:Number of buffers used by the VNET driver (int)
parm:           vnet_addr:Vnet driver host ring address (ulong)
parm:           ulimit:SCIF ulimit check (bool)
parm:           reg_cache:SCIF registration caching (bool)
parm:           huge_page:SCIF Huge Page Support (bool)
parm:           p2p:SCIF peer-to-peer (bool)
parm:           p2p_proxy:SCIF peer-to-peer proxy DMA support (bool)
parm:           watchdog:SCIF Watchdog (bool)
parm:           watchdog_auto_reboot:SCIF Watchdog auto reboot (bool)
parm:           msi:bool
parm:           mic_msi_enable:To enable MSIx in the driver.
parm:           pm_qos_cpu_dma_lat:int
parm:           mic_pm_qos_cpu_dma_lat:PM QoS CPU DMA latency in usecs.
parm:           ramoops_count:Maximum frame count for the ramoops driver. (int)
parm:           crash_dump:bool
parm:           mic_crash_dump_enabled:MIC Crash Dump enabled.
parm:           psmi:Enable/disable mic psmi (bool)

rpm -qa | grep -e intel-mic -e mpss
mpss-micmgmt-python-3.8.2-1.glibc2.12.x86_64
mpss-license-3.8.2-1.glibc2.12.x86_64
mpss-modules-dev-4.4.4-1.el7.elrepo.x86_64-3.8.2-1.x86_64
mpss-myo-dev-3.8.2-1.glibc2.12.x86_64
mpss-coi-dev-3.8.2-1.glibc2.12.x86_64
mpss-sdk-k1om-3.8.2-1.x86_64
mpss-sysmgmt-micras-3.8.2-1.glibc2.12.x86_64
mpss-myo-doc-3.8.2-1.glibc2.12.x86_64
mpss-daemon-3.8.2-1.glibc2.12.x86_64
mpss-offload-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-memdiag-kernel-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-doc-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-rasmm-kernel-3.8.2-1.glibc2.12.x86_64
mpss-daemon-dev-3.8.2-1.glibc2.12.x86_64
mpss-core-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-micdiagnostic-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-3.8.2-1.glibc2.12.x86_64
mpss-sysmgmt-python-3.8.2-1.glibc2.12.x86_64
mpss-micsmc-gui-3.8.2-1.glibc2.12.x86_64
mpss-modules-4.4.4-1.el7.elrepo.x86_64-3.8.2-1.x86_64
mpss-sciftutorials-doc-3.8.2-1.glibc2.12.x86_64
mpss-coi-doc-3.8.2-1.glibc2.12.x86_64
mpss-myo-3.8.2-1.glibc2.12.x86_64
mpss-micmgmt-3.8.2-1.glibc2.12.x86_64
mpss-mpm-3.8.2-1.glibc2.12.x86_64
mpss-offload-dev-3.8.2-1.glibc2.12.x86_64
mpss-sciftutorials-3.8.2-1.glibc2.12.x86_64
mpss-coi-3.8.2-1.glibc2.12.x86_64
mpss-eclipse-cdt-mpm-3.8.2-1.glibc2.12.x86_64
mpss-miccheck-bin-3.8.2-1.glibc2.12.x86_64
mpss-modules-headers-3.8.2-1.glibc2.12.x86_64
mpss-mpm-doc-3.8.2-1.glibc2.12.x86_64
glibc2.12pkg-mpss-flash-3.8.2-1.glibc2.12.x86_64
mpss-boot-files-3.8.2-1.glibc2.12.x86_64
mpss-core-dev-3.8.2-1.glibc2.12.x86_64

free -m
              total        used        free      shared  buff/cache   available
Mem:          16009         204       15344           8         460       15670
Swap:          8023           0        8023


lscpi
00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07)
00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07)
00:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)
00:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode (rev 07)
00:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 0 (rev 07)
00:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 1 (rev 07)
00:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 2 (rev 07)
00:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 3 (rev 07)
00:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 4 (rev 07)
00:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 5 (rev 07)
00:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 6 (rev 07)
00:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 7 (rev 07)
00:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management (rev 07)
00:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors (rev 07)
00:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
00:11.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Virtual Root Port (rev 06)
00:16.0 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #1 (rev 05)
00:16.1 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #2 (rev 05)
00:1a.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2 (rev 06)
00:1b.0 Audio device: Intel Corporation C600/X79 series chipset High Definition Audio Controller (rev 06)
00:1c.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 1 (rev b6)
00:1c.7 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 8 (rev b6)
00:1d.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1 (rev 06)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a6)
00:1f.0 ISA bridge: Intel Corporation C600/X79 series chipset LPC Controller (rev 06)
00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06)
00:1f.3 SMBus: Intel Corporation C600/X79 series chipset SMBus Host Controller (rev 06)
00:1f.6 Signal processing controller: Intel Corporation C600/X79 series chipset Thermal Management Controller (rev 06)
03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1)
04:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 4-Port SATA Storage Control Unit (rev 06)
05:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01)
07:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
08:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
08:01.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS]
09:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe-PCI Bridge [PPB]
0a:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. G200eR2
7f:08.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 0 (rev 07)
7f:08.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07)
7f:08.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07)
7f:09.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 1 (rev 07)
7f:09.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07)
7f:09.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07)
7f:0a.0 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 0 (rev 07)
7f:0a.1 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 1 (rev 07)
7f:0a.2 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 2 (rev 07)
7f:0a.3 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 3 (rev 07)
7f:0b.0 System peripheral: Intel Corporation Xeon E5/Core i7 Interrupt Control Registers (rev 07)
7f:0b.3 System peripheral: Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers (rev 07)
7f:0c.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
7f:0c.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
7f:0c.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
7f:0c.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0 (rev 07)
7f:0c.7 System peripheral: Intel Corporation Xeon E5/Core i7 System Address Decoder (rev 07)
7f:0d.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
7f:0d.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
7f:0d.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
7f:0d.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1 (rev 07)
7f:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07)
7f:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07)
7f:0f.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers (rev 07)
7f:0f.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers (rev 07)
7f:0f.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0 (rev 07)
7f:0f.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1 (rev 07)
7f:0f.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2 (rev 07)
7f:0f.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3 (rev 07)
7f:0f.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4 (rev 07)
7f:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)
7f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)
7f:10.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0 (rev 07)
7f:10.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1 (rev 07)
7f:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)
7f:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)
7f:10.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2 (rev 07)
7f:10.7 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3 (rev 07)
7f:11.0 System peripheral: Intel Corporation Xeon E5/Core i7 DDRIO (rev 07)
7f:13.0 System peripheral: Intel Corporation Xeon E5/Core i7 R2PCIe (rev 07)
7f:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07)
7f:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07)
7f:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07)
7f:13.6 System peripheral: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor (rev 07)
80:00.0 PCI bridge: Intel Corporation Xeon E5/Core i7 DMI2 in PCI Express Mode (rev 07)
80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07)
80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07)
80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode (rev 07)
80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 0 (rev 07)
80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 1 (rev 07)
80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 2 (rev 07)
80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 3 (rev 07)
80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 4 (rev 07)
80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 5 (rev 07)
80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 6 (rev 07)
80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 7 (rev 07)
80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management (rev 07)
80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors (rev 07)
80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07)
82:00.0 RAID bus controller: Adaptec Series 6 - 6G SAS/PCIe 2 (rev 01)
83:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1)
84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev 20)
ff:08.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 0 (rev 07)
ff:08.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07)
ff:08.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07)
ff:09.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 1 (rev 07)
ff:09.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07)
ff:09.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07)
ff:0a.0 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 0 (rev 07)
ff:0a.1 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 1 (rev 07)
ff:0a.2 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 2 (rev 07)
ff:0a.3 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 3 (rev 07)
ff:0b.0 System peripheral: Intel Corporation Xeon E5/Core i7 Interrupt Control Registers (rev 07)
ff:0b.3 System peripheral: Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers (rev 07)
ff:0c.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
ff:0c.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
ff:0c.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
ff:0c.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0 (rev 07)
ff:0c.7 System peripheral: Intel Corporation Xeon E5/Core i7 System Address Decoder (rev 07)
ff:0d.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
ff:0d.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
ff:0d.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07)
ff:0d.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1 (rev 07)
ff:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07)
ff:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07)
ff:0f.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers (rev 07)
ff:0f.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers (rev 07)
ff:0f.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0 (rev 07)
ff:0f.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1 (rev 07)
ff:0f.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2 (rev 07)
ff:0f.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3 (rev 07)
ff:0f.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4 (rev 07)
ff:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07)
ff:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07)
ff:10.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0 (rev 07)
ff:10.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1 (rev 07)
ff:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07)
ff:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07)
ff:10.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2 (rev 07)
ff:10.7 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3 (rev 07)
ff:11.0 System peripheral: Intel Corporation Xeon E5/Core i7 DDRIO (rev 07)
ff:13.0 System peripheral: Intel Corporation Xeon E5/Core i7 R2PCIe (rev 07)
ff:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07)
ff:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07)
ff:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07)
ff:13.6 System peripheral: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor (rev 07)

miccheck

MicCheck 3.8.2-1
Copyright (c) 2016, Intel Corporation.

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... pass

Status: OK


micinfo

MicInfo Utility Log
Created Fri Nov 17 20:19:41 2017


        System Info
                HOST OS                 : Linux
                OS Version              : 4.4.4-1.el7.elrepo.x86_64
                Driver Version          : 3.8.2-1
                MPSS Version            : 3.8.2

                Host Physical Memory    : 16009 MB

Device No: 0, Device Name: mic0

        Version
                Flash Version            : 2.1.02.0391
                SMC Firmware Version     : 1.17.6900
                SMC Boot Loader Version  : 1.8.4326
                Coprocessor OS Version   : 2.6.38.8+mpss3.8.2
                Device Serial Number     : ADKC32502182

        Board
                Vendor ID                : 0x8086
                Device ID                : 0x225c
                Subsystem ID             : 0x7d95
                Coprocessor Stepping ID  : 2
                PCIe Width               : x16
                PCIe Speed               : 5 GT/s
                PCIe Max payload size    : 256 bytes
                PCIe Max read req size   : 512 bytes
                Coprocessor Model        : 0x01
                Coprocessor Model Ext    : 0x00
                Coprocessor Type         : 0x00
                Coprocessor Family       : 0x0b
                Coprocessor Family Ext   : 0x00
                Coprocessor Stepping     : C0
                Board SKU                : C0PRQ-7120 P/A/X/D
                ECC Mode                 : Enabled
                SMC HW Revision          : Product 300W Passive CS

        Cores
                Total No of Active Cores : 61
                Voltage                  : 994000 uV
                Frequency                : 1238095 kHz

        Thermal
                Fan Speed Control        : N/A
                Fan RPM                  : N/A
                Fan PWM                  : N/A
                Die Temp                 : 46 C

        GDDR
                GDDR Vendor              : Samsung
                GDDR Version             : 0x6
                GDDR Density             : 4096 Mb
                GDDR Size                : 15872 MB
                GDDR Technology          : GDDR5
                GDDR Speed               : 5.500000 GT/s
                GDDR Frequency           : 2750000 kHz
                GDDR Voltage             : 1501000 uV

 

Now, from my Phi:

> reading /proc/cpuinfo

# One example:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 11
model           : 1
model name      : 0b/01
stepping        : 2
cpu MHz         : 1238.094
cache size      : 512 KB
physical id     : 0
siblings        : 244
core id         : 60
cpu cores       : 61
apicid          : 240
initial apicid  : 240
fpu             : yes
fpu_exception   : yes
cpuid level     : 4
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht syscall nx lm nopl lahf_lm
bogomips        : 2464.40
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

# Now listing every core, thread:
physical id     : 0
siblings        : 244
core id         : 0
cpu cores       : 61
physical id     : 0
siblings        : 244
core id         : 0
cpu cores       : 61
physical id     : 0
siblings        : 244
core id         : 0
cpu cores       : 61
physical id     : 0
siblings        : 244
core id         : 0
cpu cores       : 61
physical id     : 0
siblings        : 244
core id         : 1
cpu cores       : 61
physical id     : 0
siblings        : 244
core id         : 1
cpu cores       : 61
...
physical id     : 0
siblings        : 244
core id         : 60
cpu cores       : 61
physical id     : 0
siblings        : 244
core id         : 60
cpu cores       : 61

> lscpu
Architecture:          k1om
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                244
On-line CPU(s) list:   0-243
Thread(s) per core:    4
Core(s) per socket:    61
Socket(s):             1
Vendor ID:             GenuineIntel
CPU family:            11
Model:                 1
Stepping:              2
CPU MHz:               1238.094
BogoMIPS:              2482.64
L1d cache:             32K
L1i cache:             32K
L2d cache:             512K

> uname -a
Linux domain.es 2.6.38.8+mpss3.8.2 #1 SMP Mon Apr 24 04:54:20 EDT 2017 k1om GNU/Linux

> free -m
             total       used       free     shared    buffers     cached
Mem:         15513        374      15138          0          0        127
-/+ buffers/cache:        246      15266
Swap:            0          0          0

 

0 Kudos
4 Replies
JJK
New Contributor III
1,289 Views

I've got similar hardware (7120P) over here and am getting 19.6 GFlops/s with turbo disabled, 21.1 GFlops/s with turbo ENabled.

There are two major differences between your setup and the book (and most likely my setup):

  1. you're running EL7/CentOS7 with a 4.4.4 kernel ; this is definitely not supported by Intel, but I doubt that this affects performance that much in this sample code
  2. you've compiled the code using ICC 2017, whereas the book used ICC 2013; I've used ICC 2015 myself to get the above results.

If you want I can send you my binary, or you can send me yours and I will run the code on my box to see how it compares.

0 Kudos
user1900
Beginner
1,289 Views

Please, it will be great. Thank you for your answer.

I have been the whole weekend doing tests... and nothing.

I give to you the code and 2 versions:

- hello-flops1.mic  (vectorized O3, 2 flops per calc, 1 thread, vec aligned 64)

- hello-flops1.1op.mic  (vectorized O3, 1 flops per calc ONLY the addition, 1 thread, vec aligned 64)

To create these versions you have to comment toggle the FLOPSPERCALC 1 or 2.

It is weird, look my results. It is like the FMA is not working. Or that the "null thread" is not working like in other cases...

I was thinking about the compiler... efectively, he uses the 2013. Thanks. Still, I will wait for your answer. Please, try to execute those 2 versions with my binaries, recompile my code, and run again those 2 versions. Thank you in advance :)  It will be fantastic if you can also attach your assembly because I could compare my assembly output with yours to see if is the compiler or the system (kernel 4 as you pointed).

*Question*: can I switch on/off or see if turbo is enabled from OS? I don't have physical access to the node currently (but I will in some days). Still, this is an improvement to consider afterwards.

I post here my notes:

Theory:

Clock freq * #Cores * lanes * FMA FLOPs/cycle.

All for floating point single precision. All compiled with O3.

In the book:

1.091 GHz * 61 cores * 16 lanes (512-bit vector / 32bit floats) * 2 (FMA: fused mult add) = 2129.6 GFlops/s.
Per core: 34.912 GFlops/s.

He gets 17.206 GFlops/s in Xeon Phi 64 vec 2 FlopsPerCalc, the theory said 34.9 GFlops/s. The coprocessor always scheduler a new thread to execute each clock cycle. If you invoke only one thread on a core, the scheduler uses a special "null thread" that does nothing. We skip every other cycle when only one thread is active (page 32).

Without vectorization he gets 0.195 Gflops/s, while the theory says 2.182 (1.091 * 61 * 1 * 2 / 61)

With 2 threads (openmp) he gets 34.453 Gflops/s. On every cycle the coprocessor is working.

In my case 7120P:

1.234 GHz * 61 cores * 16 lanes * 2 = 39.488 GFlops/s.

Around 2.468 Gflops/s if not vectorized (but we get 0.202).

With 1 thread we get 9.544. 24% of the peak (worst than in the book, he reaches almost 49%).

With 2 or 4 threads: around 39.3, almost reach the peak (he reaches 98.7%, we 99.5%).

Empirical results:

| Device         | Aligned | Vectorized | FlopsPerCalc | GFlops/s |
|----------------|---------|------------|--------------|----------|
| Xeon E5-2620   |      64 | vec        |            1 |    9.171 |
| Xeon E5-2620   |      64 | vec        |            2 |   10.925 |
| Xeon E5-2620   |      64 | no-vec     |            1 |    1.643 |
| Xeon E5-2620   |      64 | no-vec     |            2 |    2.303 |
| Xeon Phi 7120P |      64 | vec        |            1 |    9.840 |
| Xeon Phi 7120P |      64 | vec        |            2 |    9.544 |
| Xeon Phi 7120P |      64 | no-vec     |            1 |    0.409 |
| Xeon Phi 7120P |      64 | no-vec     |            2 |    0.202 |

| Device         | Aligned | FlopsPerCalc | GFlops/s |
|----------------|---------|--------------|----------|
| Xeon Phi 7120P |       8 |            1 |    9.804 |
| Xeon Phi 7120P |       8 |            2 |    3.871 |
| Xeon Phi 7120P |      16 |            1 |    9.840 |
| Xeon Phi 7120P |      16 |            2 |    3.874 |
| Xeon Phi 7120P |      32 |            1 |    9.841 |
| Xeon Phi 7120P |      32 |            2 |    3.887 |

| Device         | Aligned | Threads | FlopsPerCalc | GFlops/s |
|----------------|---------|---------|--------------|----------|
| Xeon E5-2620   |      64 |       2 |            1 |    6.503 |
| Xeon E5-2620   |      64 |       2 |            2 |   11.004 |
| Xeon E5-2620   |      64 |       4 |            1 |   10.762 |
| Xeon E5-2620   |      64 |       4 |            2 |   19.521 |
| Xeon Phi 7120P |      64 |       2 |            1 |   31.490 |
| Xeon Phi 7120P |      64 |       2 |            2 |   39.365 |
| Xeon Phi 7120P |      64 |       4 |            1 |   31.459 |
| Xeon Phi 7120P |      64 |       4 |            2 |   39.245 |

 

I have seen the assembly and it is using FMA (`vfmadd213ps` in line 73 - where the multiplication and add occurs).

 

 

0 Kudos
user1900
Beginner
1,289 Views

Meanwhile I am trying to get a link to download the Intel C/C++ Compiler/Composer 2013 or 2015. I didn't find it in the Web, so I have mailed Intel Support.

Also, I attach here the assembly. But I would rather review it by myself when you do the tests that I ask you for.

 

Thanks again

0 Kudos
RN1
New Contributor I
1,289 Views

DONE :)

I got the 2015 version, installed it this afternoon, and I have been doing tests... with the version 2015 I get 19.76Gflops/s in the initial version (64 aligned, 1 thread, 2 flops per calc).

Now, the question is, Why the compiler 2015 produces faster code than the 2016-2017?

0 Kudos
Reply