- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The question is: Why the GFlops/s. of the hello-flops1.c example is almost half the performance someone got with my machine a couple of years before (in theory, with another installation of the system)?
Do you know what could be wrong in my system? Any BIOS configuration? Any package that should be reinstalled in the Phi, or in the Host? Wrong modules? Wrong compilation options?
I have in my hands the book Intel Xeon Phi from Jeffers and Reinders (1st edition) and I am following the example hello-flops1.c from the Chapter 2, in page 29.
I have a Xeon Phi KNC 7120P.
Reading here:
https://ark.intel.com/products/75799/Intel-Xeon-Phi-Coprocessor-7120P-16GB-1_238-GHz-61-core
- Clock Freq: 1.24Ghz (Max Turbo 1.33Ghz)
- Cores: 61
- Threads: 244
- Memory: 16GB
- Max Memory bandwidth: 352 GB/s
So, I have 16GB of memory compared with 8GB in the book, and my clock frequency is a bit higher (1.24 base) compared with 1.091.
When he runs his example, he gets:
- 17.206 GFlops/s.
When I run the same example in my machine, I get:
- 9.544 GFlops/s.
But, someone, some years ago ran the same example and got a number really close to that shown in the book (twice my performance nowadays).
The code is exactly this: https://github.com/intel-unesp-mcp/infieri-2017-basic/blob/master/src/hello-flops1.c
I have compiled it with `icc -mmic -O3 hello-flops1.c -o hello-flops1`.
Also, dump report logs with `icc -qopt-report=3 -mmic -O3`, reading that the loops were vectorized:
Intel(R) Advisor can now assist with vectorization and show optimization report messages with your source code. See "https://software.intel.com/en-us/intel-advisor-xe" for details. Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) MIC Architecture, Version 17.0.4.196 Build 20170411 Compiler options: -qopt-report=3 -mmic -O3 -o hello-flops1 Report from: Interprocedural optimizations [ipo] INLINING OPTION VALUES: -inline-factor: 100 -inline-min-size: 30 -inline-max-size: 230 -inline-max-total-size: 2000 -inline-max-per-routine: 10000 -inline-max-per-compile: 500000 Begin optimization report for: main(int, char **) Report from: Interprocedural optimizations [ipo] INLINE REPORT: (main(int, char **)) [1] hello-flops1.c(42,1) -> EXTERN: (53,9) printf(const char *__restrict__, ...) -> EXTERN: (60,9) printf(const char *__restrict__, ...) -> INLINE: (62,18) dtime() -> EXTERN: (23,5) gettimeofday(struct timeval *__restrict__, __timezone_ptr_t) -> INLINE: (75,18) dtime() -> EXTERN: (23,5) gettimeofday(struct timeval *__restrict__, __timezone_ptr_t) -> EXTERN: (88,14) printf(const char *__restrict__, ...) Report from: Loop nest, Vector & Auto-parallelization optimizations [loop, vec, par] LOOP BEGIN at hello-flops1.c(54,9) remark #15300: LOOP WAS VECTORIZED remark #15467: unmasked aligned streaming stores: 2 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 17 remark #15477: vector cost: 1.620 remark #15478: estimated potential speedup: 10.460 remark #15487: type converts: 6 remark #15488: --- end vector cost summary --- remark #25015: Estimate of max trip count of loop=65536 LOOP END LOOP BEGIN at hello-flops1.c(64,9) remark #15542: loop was not vectorized: inner loop was already vectorized remark #25438: unrolled without remainder by 2 remark #25015: Estimate of max trip count of loop=50000000 LOOP BEGIN at hello-flops1.c(70,13) remark #15300: LOOP WAS VECTORIZED remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 8 remark #15477: vector cost: 0.430 remark #15478: estimated potential speedup: 18.280 remark #15488: --- end vector cost summary --- LOOP END LOOP END Report from: Code generation optimizations [cg] hello-flops1.c(42,1):remark #34051: REGISTER ALLOCATION : [main] hello-flops1.c:42 Hardware registers Reserved : 2[ rsp rip] Available : 63[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm31 k0-k7] Callee-save : 6[ rbx rbp r12-r15] Assigned : 36[ rax rdx rsi rdi zmm0-zmm27 k0-k3] Routine temporaries Total : 151 Global : 27 Local : 124 Regenerable : 30 Spilled : 1 Routine stack Variables : 36 bytes* Reads : 10 [4.50e+00 ~ 0.0%] Writes : 2 [2.00e+00 ~ 0.0%] Spills : 8 bytes* Reads : 1 [0.00e+00 ~ 0.0%] Writes : 1 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. =========================================================================== Begin optimization report for: dtime() Report from: Interprocedural optimizations [ipo] INLINE REPORT: (dtime()) [2] hello-flops1.c(20,1) -> EXTERN: (23,5) gettimeofday(struct timeval *__restrict__, __timezone_ptr_t) Report from: Code generation optimizations [cg] hello-flops1.c(20,1):remark #34051: REGISTER ALLOCATION : [dtime] hello-flops1.c:20 Hardware registers Reserved : 2[ rsp rip] Available : 63[ rax rdx rcx rbx rbp rsi rdi r8-r15 mm0-mm7 zmm0-zmm31 k0-k7] Callee-save : 6[ rbx rbp r12-r15] Assigned : 9[ rax rdx rsi rdi zmm0-zmm3 k1] Routine temporaries Total : 20 Global : 6 Local : 14 Regenerable : 4 Spilled : 0 Routine stack Variables : 16 bytes* Reads : 4 [4.00e+00 ~ 17.4%] Writes : 0 [0.00e+00 ~ 0.0%] Spills : 0 bytes* Reads : 0 [0.00e+00 ~ 0.0%] Writes : 0 [0.00e+00 ~ 0.0%] Notes *Non-overlapping variables and spills may share stack space, so the total stack size might be less than this. ===========================================================================
Information from my Host and checking the Phi:
uname -a Linux batel.atc.unican.es 4.4.4-1.el7.elrepo.x86_64 #1 SMP Fri Mar 4 11:09:10 EST 2016 x86_64 x86_64 x86_64 GNU/Linux service mpss status mpss is running modinfo mic filename: /lib/modules/4.4.4-1.el7.elrepo.x86_64/extra/mic.ko license: GPL build_scmver: e8ef53c4fa26582ac37b5e0101b7451a70263f6c build_ondate: 2017-11-17 18:51:24 +0100 build_bywhom: user@domain.es build_number: 0 license: GPL license: GPL srcversion: CD00183CAD76D762A01EBD7 depends: vermagic: 4.4.4-1.el7.elrepo.x86_64 SMP mod_unload modversions parm: vnet:Vnet operating mode, one of: poll intr dma (vnetmode) parm: vnet_num_buffers:Number of buffers used by the VNET driver (int) parm: vnet_addr:Vnet driver host ring address (ulong) parm: ulimit:SCIF ulimit check (bool) parm: reg_cache:SCIF registration caching (bool) parm: huge_page:SCIF Huge Page Support (bool) parm: p2p:SCIF peer-to-peer (bool) parm: p2p_proxy:SCIF peer-to-peer proxy DMA support (bool) parm: watchdog:SCIF Watchdog (bool) parm: watchdog_auto_reboot:SCIF Watchdog auto reboot (bool) parm: msi:bool parm: mic_msi_enable:To enable MSIx in the driver. parm: pm_qos_cpu_dma_lat:int parm: mic_pm_qos_cpu_dma_lat:PM QoS CPU DMA latency in usecs. parm: ramoops_count:Maximum frame count for the ramoops driver. (int) parm: crash_dump:bool parm: mic_crash_dump_enabled:MIC Crash Dump enabled. parm: psmi:Enable/disable mic psmi (bool) rpm -qa | grep -e intel-mic -e mpss mpss-micmgmt-python-3.8.2-1.glibc2.12.x86_64 mpss-license-3.8.2-1.glibc2.12.x86_64 mpss-modules-dev-4.4.4-1.el7.elrepo.x86_64-3.8.2-1.x86_64 mpss-myo-dev-3.8.2-1.glibc2.12.x86_64 mpss-coi-dev-3.8.2-1.glibc2.12.x86_64 mpss-sdk-k1om-3.8.2-1.x86_64 mpss-sysmgmt-micras-3.8.2-1.glibc2.12.x86_64 mpss-myo-doc-3.8.2-1.glibc2.12.x86_64 mpss-daemon-3.8.2-1.glibc2.12.x86_64 mpss-offload-3.8.2-1.glibc2.12.x86_64 glibc2.12pkg-mpss-memdiag-kernel-3.8.2-1.glibc2.12.x86_64 mpss-micmgmt-doc-3.8.2-1.glibc2.12.x86_64 glibc2.12pkg-mpss-rasmm-kernel-3.8.2-1.glibc2.12.x86_64 mpss-daemon-dev-3.8.2-1.glibc2.12.x86_64 mpss-core-3.8.2-1.glibc2.12.x86_64 mpss-sysmgmt-micdiagnostic-3.8.2-1.glibc2.12.x86_64 mpss-miccheck-3.8.2-1.glibc2.12.x86_64 mpss-sysmgmt-python-3.8.2-1.glibc2.12.x86_64 mpss-micsmc-gui-3.8.2-1.glibc2.12.x86_64 mpss-modules-4.4.4-1.el7.elrepo.x86_64-3.8.2-1.x86_64 mpss-sciftutorials-doc-3.8.2-1.glibc2.12.x86_64 mpss-coi-doc-3.8.2-1.glibc2.12.x86_64 mpss-myo-3.8.2-1.glibc2.12.x86_64 mpss-micmgmt-3.8.2-1.glibc2.12.x86_64 mpss-mpm-3.8.2-1.glibc2.12.x86_64 mpss-offload-dev-3.8.2-1.glibc2.12.x86_64 mpss-sciftutorials-3.8.2-1.glibc2.12.x86_64 mpss-coi-3.8.2-1.glibc2.12.x86_64 mpss-eclipse-cdt-mpm-3.8.2-1.glibc2.12.x86_64 mpss-miccheck-bin-3.8.2-1.glibc2.12.x86_64 mpss-modules-headers-3.8.2-1.glibc2.12.x86_64 mpss-mpm-doc-3.8.2-1.glibc2.12.x86_64 glibc2.12pkg-mpss-flash-3.8.2-1.glibc2.12.x86_64 mpss-boot-files-3.8.2-1.glibc2.12.x86_64 mpss-core-dev-3.8.2-1.glibc2.12.x86_64 free -m total used free shared buff/cache available Mem: 16009 204 15344 8 460 15670 Swap: 8023 0 8023 lscpi 00:00.0 Host bridge: Intel Corporation Xeon E5/Core i7 DMI2 (rev 07) 00:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) 00:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07) 00:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode (rev 07) 00:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 0 (rev 07) 00:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 1 (rev 07) 00:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 2 (rev 07) 00:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 3 (rev 07) 00:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 4 (rev 07) 00:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 5 (rev 07) 00:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 6 (rev 07) 00:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 7 (rev 07) 00:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management (rev 07) 00:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors (rev 07) 00:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07) 00:11.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Virtual Root Port (rev 06) 00:16.0 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #1 (rev 05) 00:16.1 Communication controller: Intel Corporation C600/X79 series chipset MEI Controller #2 (rev 05) 00:1a.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #2 (rev 06) 00:1b.0 Audio device: Intel Corporation C600/X79 series chipset High Definition Audio Controller (rev 06) 00:1c.0 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 1 (rev b6) 00:1c.7 PCI bridge: Intel Corporation C600/X79 series chipset PCI Express Root Port 8 (rev b6) 00:1d.0 USB controller: Intel Corporation C600/X79 series chipset USB2 Enhanced Host Controller #1 (rev 06) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a6) 00:1f.0 ISA bridge: Intel Corporation C600/X79 series chipset LPC Controller (rev 06) 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset 6-Port SATA AHCI Controller (rev 06) 00:1f.3 SMBus: Intel Corporation C600/X79 series chipset SMBus Host Controller (rev 06) 00:1f.6 Signal processing controller: Intel Corporation C600/X79 series chipset Thermal Management Controller (rev 06) 03:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) 04:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset 4-Port SATA Storage Control Unit (rev 06) 05:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 05:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network Connection (rev 01) 07:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS] 08:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS] 08:01.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe Switch [PS] 09:00.0 PCI bridge: Renesas Technology Corp. SH7757 PCIe-PCI Bridge [PPB] 0a:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. G200eR2 7f:08.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 0 (rev 07) 7f:08.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07) 7f:08.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07) 7f:09.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 1 (rev 07) 7f:09.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07) 7f:09.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07) 7f:0a.0 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 0 (rev 07) 7f:0a.1 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 1 (rev 07) 7f:0a.2 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 2 (rev 07) 7f:0a.3 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 3 (rev 07) 7f:0b.0 System peripheral: Intel Corporation Xeon E5/Core i7 Interrupt Control Registers (rev 07) 7f:0b.3 System peripheral: Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers (rev 07) 7f:0c.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) 7f:0c.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) 7f:0c.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) 7f:0c.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0 (rev 07) 7f:0c.7 System peripheral: Intel Corporation Xeon E5/Core i7 System Address Decoder (rev 07) 7f:0d.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) 7f:0d.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) 7f:0d.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) 7f:0d.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1 (rev 07) 7f:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07) 7f:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07) 7f:0f.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers (rev 07) 7f:0f.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers (rev 07) 7f:0f.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0 (rev 07) 7f:0f.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1 (rev 07) 7f:0f.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2 (rev 07) 7f:0f.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3 (rev 07) 7f:0f.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4 (rev 07) 7f:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07) 7f:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07) 7f:10.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0 (rev 07) 7f:10.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1 (rev 07) 7f:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07) 7f:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07) 7f:10.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2 (rev 07) 7f:10.7 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3 (rev 07) 7f:11.0 System peripheral: Intel Corporation Xeon E5/Core i7 DDRIO (rev 07) 7f:13.0 System peripheral: Intel Corporation Xeon E5/Core i7 R2PCIe (rev 07) 7f:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07) 7f:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07) 7f:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07) 7f:13.6 System peripheral: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor (rev 07) 80:00.0 PCI bridge: Intel Corporation Xeon E5/Core i7 DMI2 in PCI Express Mode (rev 07) 80:01.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 1a (rev 07) 80:02.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 2a (rev 07) 80:03.0 PCI bridge: Intel Corporation Xeon E5/Core i7 IIO PCI Express Root Port 3a in PCI Express Mode (rev 07) 80:04.0 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 0 (rev 07) 80:04.1 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 1 (rev 07) 80:04.2 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 2 (rev 07) 80:04.3 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 3 (rev 07) 80:04.4 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 4 (rev 07) 80:04.5 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 5 (rev 07) 80:04.6 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 6 (rev 07) 80:04.7 System peripheral: Intel Corporation Xeon E5/Core i7 DMA Channel 7 (rev 07) 80:05.0 System peripheral: Intel Corporation Xeon E5/Core i7 Address Map, VTd_Misc, System Management (rev 07) 80:05.2 System peripheral: Intel Corporation Xeon E5/Core i7 Control Status and Global Errors (rev 07) 80:05.4 PIC: Intel Corporation Xeon E5/Core i7 I/O APIC (rev 07) 82:00.0 RAID bus controller: Adaptec Series 6 - 6G SAS/PCIe 2 (rev 01) 83:00.0 3D controller: NVIDIA Corporation GK110GL [Tesla K20m] (rev a1) 84:00.0 Co-processor: Intel Corporation Xeon Phi coprocessor SE10/7120 series (rev 20) ff:08.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 0 (rev 07) ff:08.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07) ff:08.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 0 (rev 07) ff:09.0 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link 1 (rev 07) ff:09.3 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07) ff:09.4 System peripheral: Intel Corporation Xeon E5/Core i7 QPI Link Reut 1 (rev 07) ff:0a.0 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 0 (rev 07) ff:0a.1 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 1 (rev 07) ff:0a.2 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 2 (rev 07) ff:0a.3 System peripheral: Intel Corporation Xeon E5/Core i7 Power Control Unit 3 (rev 07) ff:0b.0 System peripheral: Intel Corporation Xeon E5/Core i7 Interrupt Control Registers (rev 07) ff:0b.3 System peripheral: Intel Corporation Xeon E5/Core i7 Semaphore and Scratchpad Configuration Registers (rev 07) ff:0c.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) ff:0c.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) ff:0c.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) ff:0c.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 0 (rev 07) ff:0c.7 System peripheral: Intel Corporation Xeon E5/Core i7 System Address Decoder (rev 07) ff:0d.0 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) ff:0d.1 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) ff:0d.2 System peripheral: Intel Corporation Xeon E5/Core i7 Unicast Register 0 (rev 07) ff:0d.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller System Address Decoder 1 (rev 07) ff:0e.0 System peripheral: Intel Corporation Xeon E5/Core i7 Processor Home Agent (rev 07) ff:0e.1 Performance counters: Intel Corporation Xeon E5/Core i7 Processor Home Agent Performance Monitoring (rev 07) ff:0f.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Registers (rev 07) ff:0f.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller RAS Registers (rev 07) ff:0f.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 0 (rev 07) ff:0f.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 1 (rev 07) ff:0f.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 2 (rev 07) ff:0f.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 3 (rev 07) ff:0f.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Target Address Decoder 4 (rev 07) ff:10.0 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 0 (rev 07) ff:10.1 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 1 (rev 07) ff:10.2 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 0 (rev 07) ff:10.3 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 1 (rev 07) ff:10.4 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 2 (rev 07) ff:10.5 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller Channel 0-3 Thermal Control 3 (rev 07) ff:10.6 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 2 (rev 07) ff:10.7 System peripheral: Intel Corporation Xeon E5/Core i7 Integrated Memory Controller ERROR Registers 3 (rev 07) ff:11.0 System peripheral: Intel Corporation Xeon E5/Core i7 DDRIO (rev 07) ff:13.0 System peripheral: Intel Corporation Xeon E5/Core i7 R2PCIe (rev 07) ff:13.1 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to PCI Express Performance Monitor (rev 07) ff:13.4 Performance counters: Intel Corporation Xeon E5/Core i7 QuickPath Interconnect Agent Ring Registers (rev 07) ff:13.5 Performance counters: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 0 Performance Monitor (rev 07) ff:13.6 System peripheral: Intel Corporation Xeon E5/Core i7 Ring to QuickPath Interconnect Link 1 Performance Monitor (rev 07) miccheck MicCheck 3.8.2-1 Copyright (c) 2016, Intel Corporation. Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Test 7 (mic0): Check running SMC firmware version is correct ... pass Status: OK micinfo MicInfo Utility Log Created Fri Nov 17 20:19:41 2017 System Info HOST OS : Linux OS Version : 4.4.4-1.el7.elrepo.x86_64 Driver Version : 3.8.2-1 MPSS Version : 3.8.2 Host Physical Memory : 16009 MB Device No: 0, Device Name: mic0 Version Flash Version : 2.1.02.0391 SMC Firmware Version : 1.17.6900 SMC Boot Loader Version : 1.8.4326 Coprocessor OS Version : 2.6.38.8+mpss3.8.2 Device Serial Number : ADKC32502182 Board Vendor ID : 0x8086 Device ID : 0x225c Subsystem ID : 0x7d95 Coprocessor Stepping ID : 2 PCIe Width : x16 PCIe Speed : 5 GT/s PCIe Max payload size : 256 bytes PCIe Max read req size : 512 bytes Coprocessor Model : 0x01 Coprocessor Model Ext : 0x00 Coprocessor Type : 0x00 Coprocessor Family : 0x0b Coprocessor Family Ext : 0x00 Coprocessor Stepping : C0 Board SKU : C0PRQ-7120 P/A/X/D ECC Mode : Enabled SMC HW Revision : Product 300W Passive CS Cores Total No of Active Cores : 61 Voltage : 994000 uV Frequency : 1238095 kHz Thermal Fan Speed Control : N/A Fan RPM : N/A Fan PWM : N/A Die Temp : 46 C GDDR GDDR Vendor : Samsung GDDR Version : 0x6 GDDR Density : 4096 Mb GDDR Size : 15872 MB GDDR Technology : GDDR5 GDDR Speed : 5.500000 GT/s GDDR Frequency : 2750000 kHz GDDR Voltage : 1501000 uV
Now, from my Phi:
> reading /proc/cpuinfo # One example: processor : 0 vendor_id : GenuineIntel cpu family : 11 model : 1 model name : 0b/01 stepping : 2 cpu MHz : 1238.094 cache size : 512 KB physical id : 0 siblings : 244 core id : 60 cpu cores : 61 apicid : 240 initial apicid : 240 fpu : yes fpu_exception : yes cpuid level : 4 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr mca pat fxsr ht syscall nx lm nopl lahf_lm bogomips : 2464.40 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: # Now listing every core, thread: physical id : 0 siblings : 244 core id : 0 cpu cores : 61 physical id : 0 siblings : 244 core id : 0 cpu cores : 61 physical id : 0 siblings : 244 core id : 0 cpu cores : 61 physical id : 0 siblings : 244 core id : 0 cpu cores : 61 physical id : 0 siblings : 244 core id : 1 cpu cores : 61 physical id : 0 siblings : 244 core id : 1 cpu cores : 61 ... physical id : 0 siblings : 244 core id : 60 cpu cores : 61 physical id : 0 siblings : 244 core id : 60 cpu cores : 61 > lscpu Architecture: k1om CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 244 On-line CPU(s) list: 0-243 Thread(s) per core: 4 Core(s) per socket: 61 Socket(s): 1 Vendor ID: GenuineIntel CPU family: 11 Model: 1 Stepping: 2 CPU MHz: 1238.094 BogoMIPS: 2482.64 L1d cache: 32K L1i cache: 32K L2d cache: 512K > uname -a Linux domain.es 2.6.38.8+mpss3.8.2 #1 SMP Mon Apr 24 04:54:20 EDT 2017 k1om GNU/Linux > free -m total used free shared buffers cached Mem: 15513 374 15138 0 0 127 -/+ buffers/cache: 246 15266 Swap: 0 0 0
- Tags:
- Cluster Computing
- Enterprise
- Intel® Many Integrated Core Architecture
- Linux*
- Optimization
- Parallel Computing
- Professors
- Unix*
- Vectorization
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've got similar hardware (7120P) over here and am getting 19.6 GFlops/s with turbo disabled, 21.1 GFlops/s with turbo ENabled.
There are two major differences between your setup and the book (and most likely my setup):
- you're running EL7/CentOS7 with a 4.4.4 kernel ; this is definitely not supported by Intel, but I doubt that this affects performance that much in this sample code
- you've compiled the code using ICC 2017, whereas the book used ICC 2013; I've used ICC 2015 myself to get the above results.
If you want I can send you my binary, or you can send me yours and I will run the code on my box to see how it compares.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please, it will be great. Thank you for your answer.
I have been the whole weekend doing tests... and nothing.
I give to you the code and 2 versions:
- hello-flops1.mic (vectorized O3, 2 flops per calc, 1 thread, vec aligned 64)
- hello-flops1.1op.mic (vectorized O3, 1 flops per calc ONLY the addition, 1 thread, vec aligned 64)
To create these versions you have to comment toggle the FLOPSPERCALC 1 or 2.
It is weird, look my results. It is like the FMA is not working. Or that the "null thread" is not working like in other cases...
I was thinking about the compiler... efectively, he uses the 2013. Thanks. Still, I will wait for your answer. Please, try to execute those 2 versions with my binaries, recompile my code, and run again those 2 versions. Thank you in advance :) It will be fantastic if you can also attach your assembly because I could compare my assembly output with yours to see if is the compiler or the system (kernel 4 as you pointed).
*Question*: can I switch on/off or see if turbo is enabled from OS? I don't have physical access to the node currently (but I will in some days). Still, this is an improvement to consider afterwards.
I post here my notes:
Theory:
Clock freq * #Cores * lanes * FMA FLOPs/cycle.
All for floating point single precision. All compiled with O3.
In the book:
1.091 GHz * 61 cores * 16 lanes (512-bit vector / 32bit floats) * 2 (FMA: fused mult add) = 2129.6 GFlops/s.
Per core: 34.912 GFlops/s.
He gets 17.206 GFlops/s in Xeon Phi 64 vec 2 FlopsPerCalc, the theory said 34.9 GFlops/s. The coprocessor always scheduler a new thread to execute each clock cycle. If you invoke only one thread on a core, the scheduler uses a special "null thread" that does nothing. We skip every other cycle when only one thread is active (page 32).
Without vectorization he gets 0.195 Gflops/s, while the theory says 2.182 (1.091 * 61 * 1 * 2 / 61)
With 2 threads (openmp) he gets 34.453 Gflops/s. On every cycle the coprocessor is working.
In my case 7120P:
1.234 GHz * 61 cores * 16 lanes * 2 = 39.488 GFlops/s.
Around 2.468 Gflops/s if not vectorized (but we get 0.202).
With 1 thread we get 9.544. 24% of the peak (worst than in the book, he reaches almost 49%).
With 2 or 4 threads: around 39.3, almost reach the peak (he reaches 98.7%, we 99.5%).
Empirical results:
| Device | Aligned | Vectorized | FlopsPerCalc | GFlops/s | |----------------|---------|------------|--------------|----------| | Xeon E5-2620 | 64 | vec | 1 | 9.171 | | Xeon E5-2620 | 64 | vec | 2 | 10.925 | | Xeon E5-2620 | 64 | no-vec | 1 | 1.643 | | Xeon E5-2620 | 64 | no-vec | 2 | 2.303 | | Xeon Phi 7120P | 64 | vec | 1 | 9.840 | | Xeon Phi 7120P | 64 | vec | 2 | 9.544 | | Xeon Phi 7120P | 64 | no-vec | 1 | 0.409 | | Xeon Phi 7120P | 64 | no-vec | 2 | 0.202 | | Device | Aligned | FlopsPerCalc | GFlops/s | |----------------|---------|--------------|----------| | Xeon Phi 7120P | 8 | 1 | 9.804 | | Xeon Phi 7120P | 8 | 2 | 3.871 | | Xeon Phi 7120P | 16 | 1 | 9.840 | | Xeon Phi 7120P | 16 | 2 | 3.874 | | Xeon Phi 7120P | 32 | 1 | 9.841 | | Xeon Phi 7120P | 32 | 2 | 3.887 | | Device | Aligned | Threads | FlopsPerCalc | GFlops/s | |----------------|---------|---------|--------------|----------| | Xeon E5-2620 | 64 | 2 | 1 | 6.503 | | Xeon E5-2620 | 64 | 2 | 2 | 11.004 | | Xeon E5-2620 | 64 | 4 | 1 | 10.762 | | Xeon E5-2620 | 64 | 4 | 2 | 19.521 | | Xeon Phi 7120P | 64 | 2 | 1 | 31.490 | | Xeon Phi 7120P | 64 | 2 | 2 | 39.365 | | Xeon Phi 7120P | 64 | 4 | 1 | 31.459 | | Xeon Phi 7120P | 64 | 4 | 2 | 39.245 |
I have seen the assembly and it is using FMA (`vfmadd213ps` in line 73 - where the multiplication and add occurs).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Meanwhile I am trying to get a link to download the Intel C/C++ Compiler/Composer 2013 or 2015. I didn't find it in the Web, so I have mailed Intel Support.
Also, I attach here the assembly. But I would rather review it by myself when you do the tests that I ask you for.
Thanks again
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DONE :)
I got the 2015 version, installed it this afternoon, and I have been doing tests... with the version 2015 I get 19.76Gflops/s in the initial version (64 aligned, 1 thread, 2 flops per calc).
Now, the question is, Why the compiler 2015 produces faster code than the 2016-2017?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page