Latency and Throughput of Intel CPUs 'clflush' instruction - Page 2

SergeyKostrov · ‎09-23-2016

*** Latency and Throughput of Intel CPUs 'clflush' instruction ***

SergeyKostrov · ‎09-23-2016

[ Flush Cache intrinsics on AMD CPU architectures ] Not reviewed.

SergeyKostrov · ‎09-23-2016

[ Intel References ] Intel 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-033 June 2016 ... Chapter: INSTRUCTION LATENCY AND THROUGHPUT Table C-17. General Purpose Instructions ( Page C-17 ) ... CLFLUSH throughputs for different CPUs are ~2 to 50, ~3 to 50, ~3 to 50 and ~5 to 50 clock cycles. ... ... Note 13 ( Page C-19 ): ... CLFLUSH throughput is representative from clean cache lines for a range of buffer sizes. CLFLUSH throughput can decrease significantly by factors including: (a) the number of back-to-back CLFLUSH being executed, (b) flushing modified cache lines incurs additional cost than cache lines in other coherent state. See Section 7.4.6. ... Section 7.4.6 CLFLUSH Instruction ( Page 7-9 ): It provides additional information for the instruction. Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2 ( 2A, 2B, 2C & 2D ): Instruction Set Reference, A-Z Order Number: 325383-059US June 2016 ... Chapter: INSTRUCTION SET REFERENCE, A-L CLFLUSH - Flush Cache Line ( Page 3-140 vol. 2A ) A very important note is as follows: ... data can be speculatively loaded into a cache line just before, during, or after the execution of a CLFLUSH instruction that references the cache line ...

SergeyKostrov · ‎09-23-2016

[ External References ] 1. Coherence with Cached Memory-Mapped IO John D. McCalpin, Ph.D, 2013 https://sites.utexas.edu/jdm4372/2013/05

SergeyKostrov · ‎09-25-2016

[ Command Line Options of C++ compilers ] Command Line Options of C++ compilers used in these performance evaluations will be provided.

SergeyKostrov · ‎09-25-2016

[ Borland C++ compiler v5.5.1 32-bit ] -d -O2 -w -D_WIN32_BCC -DNDEBUG -5 -nRelease -eBccTestApp.exe -I"C:\WorkLib\MKL\Include" -L"C:\WorkLib\MKL\Lib\Ia32Bcc" -lS:33554432 BccTestApp.cpp HrtALLib.asm

SergeyKostrov · ‎09-25-2016

[ MinGW C++ compiler v6.1.0 32-bit ] MgwTestApp.cpp -DNDEBUG -O3 -msse2 -mprfchw -ffast-math -fpeel-loops -ftree-vectorizer-verbose=0 -ftree-vectorize -fvect-cost-model -fomit-frame-pointer -flto -fwhole-program -fopenmp -w -I "C:/WorkLib/ICC2011/Composer XE/Mkl/Include" -B "../../AppsSca" "C:/WorkLib/ICC2011/Composer XE/Mkl/Lib/Ia32/mkl_rt.lib" -Xlinker --stack=67108864

SergeyKostrov · ‎09-25-2016

[ Microsoft C++ compiler ( VS2005 PE ) 32-bit ] [ Compiler ] /O2 /Ob1 /Oi /Ot /Oy /GL /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_MSC" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /Gm /MT /GS- /fp:fast /GR- /openmp /Yu"Stdphf.h" /Fp"Release\MscTestApp.pch" /Fo"Release/" /Fd"Release/" /W4 /nologo /c /Wp64 /Zi /Gd /TP /wd4005 /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_ICC" /U "_WIN32_WCC" /errorReport:prompt /arch:SSE2 [ Linker ] /OUT:"Release/MscTestApp.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"Release\MscTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /LTCG /MACHINE:X86 /ERRORREPORT:PROMPT kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib "..\..\bin\release\scalib.lib"

SergeyKostrov · ‎09-25-2016

[ Intel C++ compiler v12.1.7 ( u371 ) 32-bit ] [ Compiler ] /c /O3 /Ob1 /Oi /Ot /Oy /Qipo /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_ICC" /D "INTEL_SUITE_VERSION=PE121_300" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /MT /GS- /fp:fast=2 /GR- /Yu"Stdphf.h" /Fp"Release\IccTestApp.pch" /Fo"Release/" /W5 /nologo /Wp64 /Zi /Gd /TP /Qdiag-disable:2012 /Qdiag-disable:2013 /Qdiag-disable:2014 /Qdiag-disable:2015 /Qdiag-disable:2017 /Qdiag-disable:2021 /Qdiag-disable:2022 /Qdiag-disable:2304 /U "_WIN32_MSC" /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_WCC" /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qparallel /Qstd=c++0x /Qrestrict /Qdiag-disable:111,673,10121 /Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2 /Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2 [ Linker ] kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"Release/IccTestApp.exe" /INCREMENTAL:NO /nologo /MANIFEST /MANIFESTFILE:"Release\IccTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /MACHINE:X86 /qdiag-disable:111,673,10121

SergeyKostrov · ‎09-25-2016

[ Watcom C++ compiler v2.0.0 32-bit ] WccTestApp.cpp -5r -fp5 -fpi87 -wx -d0 -s -oabil+mprt -xd -D_WIN32_WCC -DNDEBUG -feWccTestApp.exe -k268435456 -i"C:\WorkLib\ICC2011\Compos~1\Mkl\Include" -"libpath C:\WorkLib\ICC2011\Compos~1\Mkl\Lib\Ia32Wcc" -wcd=007 -wcd=008 -wcd=013 -wcd=014 -wcd=086 -wcd=188 -wcd=367 -wcd=368 -wcd=369 -wcd=387 -wcd=389 -wcd=549 -wcd=601 -wcd=628 -wcd=689 -wcd=716 -wcd=725 -wcd=726 -wcd=735

SergeyKostrov · ‎09-25-2016

Correction for Post #9 >>[ Run-Time testing - Extended Tracing - No ] >>[ Borland C++ compiler ] >> >>... >>... >> >>Note: This is the worst case and related to how CLFLUSH and RDTSC instructions are implemented in software. A correct Note is: This is the worst case and related to how CrtClflush and CrtRdtsc C-functions are implemented in software.

SergeyKostrov · ‎09-27-2016

[ A workaround for Intel C++ compiler ] The problem has two parts, that is: - RDTSC instruction was Not aligned on a 16-byte boundary for Intel C++ compiler - Pipelining of a series of CLFLUSH instructions is affected when a MOV instruction is inserted after the 1st CLFLUSH instruction By the way, Watcom C++ compiler's binary of codes are Not aligned on a 16-byte boundary and it doesn't have any problems! So, I decided to use a workaround by forcing an alignment on a 16-byte boundary ( _DEFAULT_CODEALIGN16 is a macro based on _asm ALIGN 16 assembler directive ). ... _DEFAULT_CODEALIGN16; RTuint64 uiClock1 = CrtRdtsc(); CrtClflush( &piAddress[0][0] ); CrtClflush( &piAddress[1][0] ); CrtClflush( &piAddress[2][0] ); CrtClflush( &piAddress[3][0] ); CrtClflush( &piAddress[4][0] ); CrtClflush( &piAddress[5][0] ); CrtClflush( &piAddress[6][0] ); CrtClflush( &piAddress[7][0] ); CrtClflush( &piAddress[8][0] ); CrtClflush( &piAddress[9][0] ); RTuint64 uiClock2 = CrtRdtsc(); ... Here is statistics for a memory address of 1st RDTSC instruction: MSC - 00244490 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes ICC - 00403660 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes MGW - 00402490 % 0x10 = 0 - Aligned on 16-byte boundary? - Yes BCC - 0040417A % 0x10 = 10 - Aligned on 16-byte boundary? - No WCC - 00403791 % 0x10 = 11 - Aligned on 16-byte boundary? - No [ Run-Time testing - Extended Tracing - No ] [ Intel C++ compiler ] ... [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles [ CrtClflush ] - Executed in 12 clock cycles ...

SergeyKostrov · ‎09-27-2016

[ Intel C++ compiler New Features Request - Re-ordering statistics and control ] - A Warning Message at /W5 level ( Not at another levels ) needs to be displayed when there is a Re-ordering - Introduction a '#pragma no-reordering' directive for a piece of critical codes to prevent any Re-orderings - A command line compiler option to control Re-ordering of instructions ( similar to Watcom C++ compiler option '-or' Re-order instructions to avoid stalls )

SergeyKostrov · ‎09-27-2016

[ MinGW C++ compiler command line options ] For example, these are command line options for different types of Re-orders supported by MinGW C++ compiler: ... -Wreorder - Warn when the compiler reorders code. -freorder-blocks - Reorder basic blocks to improve code placement. -freorder-blocks-algorithm= - -freorder-blocks-algorithm=[simple|stc] Set the used basic block reordering algorithm. -freorder-blocks-and-partition - Reorder basic blocks and partition into hot and cold sections. -freorder-functions - Reorder functions to improve code placement. -fprofile-reorder-functions - Enable function reordering that improves code placement. -ftoplevel-reorder - Reorder top level functions, variables, and asms. ...

SergeyKostrov · ‎09-27-2016

[ A note from Intel software engineer ] >>... >>Compiler optimization may re-order instructions based on instruction latency/throughput targeting different micro-architecture. >>... I understand it but my point is: Intel C++ compiler should give us a greater control in similar to my cases. If a Software Engineer has some specs, knows how some processing needs to be done ( its order, number of instructions, estimated number of clock cycles to complete the processing, etc ), then Intel C++ compiler should Not interfere with the Software Engineer's codes. Of course, implementation with assembler solves all these problems but it is more time consuming to implement and it breaks portability of C/C++ source codes.

SergeyKostrov · ‎09-28-2016

Here are performance results when the Serial-Test-Case was converted to a 10-interations For-Loop-Test-Case. Performance results from the best to the worst: [ MinGW C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 120 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 84 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 3 clock cycles ... [ Intel C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 196 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 152 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 4 clock cycles ... [ Watcom C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 212 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 128 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles ... [ Microsoft C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 192 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 88 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 10 clock cycles ... [ Borland C++ compiler ] ... [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 964 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 264 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 70 clock cycles ... Results are very reproducible and I see that in case of MinGW and Intel C++ compilers I was able to achieve low-bound numbers for CLFLUSH instruction stated by Intel in: Intel 64 and IA-32 Architectures Optimization Reference Manual Order Number: 248966-033 June 2016 ... Chapter: INSTRUCTION LATENCY AND THROUGHPUT Table C-17. General Purpose Instructions ( Page C-17 ) ... CLFLUSH throughputs for different CPUs are ~2 to 50, ~3 to 50, ~3 to 50 and ~5 to 50 clock cycles. ...

SergeyKostrov · ‎09-28-2016

A latency of RDTSC instruction should Not be taken into account when two RDTSC instructions are called one after another. This is because RDTSC instruction latency is a constant for a Processing Unit, the same number of mu-ops are executed by the Processing Unit in both cases and RDTSC latency will be canceled. RDTSC instruction, as you know, simply reads and returns a value of Time Stamp Counter ( TSC ) of the Processing Unit. In a general form, if ... T1 = RDTSC() Processing... T2 = RDTSC() ... then Processing Completed in Clock Cycles = ( T2 + RDTSCoverhead ) - ( T1 + RDTSCoverhead ) = ( T2 + RDTSCoverhead - T1 - RDTSCoverhead ) = ( T2 - T1 ). A latency of RDTSC instruction is Not Known since Intel does Not release any information and take a look at Intel 64 and IA-32 Architectures Optimization Reference Manual.

SergeyKostrov · ‎09-28-2016

[ MinGW C++ compiler - 64-bit - Ivy Bridge ] ... [ Sub-Test002.21.A - CrtClflush ] - Executed in 6 clock cycles [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 44 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 24 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 2 clock cycles ... [ Intel C++ compiler - 64-bit - Ivy Bridge ] ... [ Sub-Test002.21.A - CrtClflush ] - Executed in 6 clock cycles [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 120 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 100 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 2 clock cycles ... [ Watcom C++ compiler - 32-bit - Ivy Bridge ] ... [ Sub-Test002.21.A - CrtClflush ] - Executed in 7 clock cycles [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 128 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 92 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 3 clock cycles ... [ Microsoft C++ compiler - 64-bit - Ivy Bridge ] ... [ Sub-Test002.21.A - CrtClflush ] - Executed in 15 clock cycles [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 108 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 28 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles ... [ Borland C++ compiler - 64-bit - Ivy Bridge ] ... [ Sub-Test002.21.A - CrtClflush ] - Executed in 85 clock cycles [ Sub-Test002.21.B - Processing of 10 calls ] - Executed in 232 clock cycles [ Sub-Test002.21.B - For-Loop Overhead ] - Executed in 144 clock cycles [ Sub-Test002.21.B - CrtClflush ] - Executed in 8 clock cycles ...

Zirak · ‎11-13-2017

Hi Sergey,

Thank you for the study (measuring clflush latency). It is really interesting. However, with the presence of HPCs, have you tried other hardware events by utilising PMCs to do this measure? If so, what events are more efficient in measuring clflush latency?