Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Oversubscription of OpenMP threads for processing small data sets

SergeyKostrov
Valued Contributor II
9,819 Views
*** Oversubscription of OpenMP threads for processing small data sets ***
0 Kudos
87 Replies
SergeyKostrov
Valued Contributor II
3,667 Views
[ Abstract ] Oversubscription of OpenMP threads is not a new processing technique and it is known for a long time. Are there any benefits of processing small data sets, which fit into L3 or L2 cache lines, with a number of OpenMP threads that is greater than a number of hardware threads of a multi-core CPU?
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Computer Systems used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768 ** Dell Dimension 4400 ** Intel Pentium 4 ( 1.60 GHz / 1 core ) 1GB RAM Seagate 20GB HDD ( * ) Seagate 3TB HDD ( ** ) EVGA GeForce 6200 Video Card 512MB DDR2 AGP 8x Video Card Windows XP Professional 32-bit SP3 Size of L2 Cache = 256KB Size of L1 Cache = 8KB Display resolution: 1440 x 990 ( * ) Seagate Barracuda 20GB IDE Hard Disk Drive ST320011A 3.5" 7200 Rpm 2MB Cache IDE Ultra ATA100 / ATA-iV/6 Average Rotational Latency : 4.17 ms Average Seek Times Read : 9.0ms Average Seek Times Write : 10.0ms Maximum Internal Transfer Rate : 69.4MB/sec Average External Transfer Rate : 100MB/sec ( Read and Write ) Maximum External Transfer Rate : 150MB/sec ( Read ) Note: Barracuda ATA IV Family ( ** ) Seagate Barracuda 3TB IDE Hard Disk Drive ST3000DM001 3.5" 7200 Rpm 64MB Cache SATA III ( 6GB/sec ) Average Rotational Latency : 4.16 ms Average Seek Times Read : 8.5ms Average Seek Times Write : 9.5ms Maximum Internal Transfer Rate : 268MB/sec Average External Transfer Rate : 156MB/sec ( Read and Write ) Maximum External Transfer Rate : 210MB/sec ( Read )
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Intel C++ compiler v12.1.7 ( u371 ) 32-bit ] [ Compiler ] /c /O3 /Ob1 /Oi /Ot /Oy /Qipo /I "..\..\Include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_ICC" /D "INTEL_SUITE_VERSION=PE121_300" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /MT /GS- /fp:fast=2 /GR- /Yu"Stdphf.h" /Fp"Release\IccTestApp.pch" /Fo"Release/" /W5 /nologo /Wp64 /Zi /Gd /TP /Qdiag-disable:2012 /Qdiag-disable:2013 /Qdiag-disable:2014 /Qdiag-disable:2015 /Qdiag-disable:2017 /Qdiag-disable:2021 /Qdiag-disable:2022 /Qdiag-disable:2304 /U "_WIN32_MSC" /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_WCC" /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qparallel /Qstd=c++0x /Qrestrict /Qdiag-disable:111,673,10121 /Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2 /Wport /Qeffc++ /QxSSE2 /Qansi-alias /Qvec-report=0 /Qfma /Qunroll:8 /Qunroll-aggressive /Qopt-streaming-stores:always /Qopt-block-factor:128 /Qopt-mem-layout-trans:2 [ Linker ] kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"Release/IccTestApp.exe" /INCREMENTAL:NO /nologo /MANIFEST /MANIFESTFILE:"Release\IccTestApp.exe.intermediate.manifest" /NODEFAULTLIB:"../../Bin/Release/ScaLib.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /STACK:268435456 /LARGEADDRESSAWARE /MACHINE:X86 /qdiag-disable:111,673,10121
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Intel C++ compiler v13.1.0 ( u149 ) 64-bit ] [ Compiler ] /c /O3 /Ob1 /Oi /Ot /Qipo /I "..\..\Include" /I "C:\WorkLib\ICC2013\Composer XE 2013\ipp\include" /D "WIN32" /D "_CONSOLE" /D "NDEBUG" /D "_WIN32_ICC" /D "INTEL_SUITE_VERSION=PE130_149" /D "_IPP_PARALLEL_DYNAMIC" /D "IPP_USE_CUSTOM" /D "_VC80_UPGRADE=0x0710" /D "_UNICODE" /D "UNICODE" /GF /MT /GS- /arch:AVX /fp:fast=2 /GR- /Yu"Stdphf.h" /Fp"x64\Release\IccTestApp64.pch" /Fo"x64/Release/" /Fd"x64/Release/" /W5 /nologo /Wp64 /Zi /TP /U "_WIN32_MSC" /U "_WINCE_MSC" /U "WIN32_PLATFORM_PSPC" /U "WIN32_PLATFORM_WFSP" /U "WIN32_PLATFORM_WM50" /U "_WIN32_MGW" /U "_WIN32_BCC" /U "_COS16_TCC" /U "_WIN32_WCC" /Qopenmp /Qfp-speculation:fast /Qopt-matmul /Qstd=c++0x /Qrestrict /Qansi-alias /Qdiag-disable:111,673,2012,2015,2960,10121 /Wport /Qeffc++ /QxAVX /Qansi-alias /Qvec-report=0 /Qfma /Qunroll /Qunroll-aggressive /Qopt-streaming-stores:always /Qipp /Qipp-link:dynamic /Qmkl [ Linker ] kernel32.lib user32.lib gdi32.lib winspool.lib comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib uuid.lib odbc32.lib odbccp32.lib /OUT:"x64\Release/IccTestApp64.exe" /INCREMENTAL:NO /nologo /LIBPATH:"C:\WorkLib\ICC2013\Composer XE 2013\ipp\lib\intel64" /LIBPATH:"C:\WorkLib\ICC2013\Composer XE 2013\compiler\lib\intel64" /MANIFEST /MANIFESTFILE:"x64\Release\IccTestApp64.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /NODEFAULTLIB:"../../Bin/Release/ScaLib64.lib" /TLBID:1 /SUBSYSTEM:CONSOLE /STACK:1000000000 /LARGEADDRESSAWARE /DYNAMICBASE /NXCOMPAT /MACHINE:X64 /qdiag-disable:111,673,2012,2015,2960,10121 /qdiag-sc-dir:"My Inspector XE Results - IccTestApp"
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ List of Abbreviations ] MM - Matrix Multiplication C - Classic LPS - Loop Processing Schema 1D - One Dimensional Input Matrices 2D - Two Dimensional Input Matrices LB - Loop Blocking ( OT ) LBOT - Loop Blocking Optimization Technique F - Fused ( OT ) T - Transposed ( OT )
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 1 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 1 ] [ Matrix Size: 128 x 128 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 128 x 128 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.00147 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 128x128 elements Completed: 0.00170 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.00391 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 128x128 elements Completed: 0.00391 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.00172 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 128x128 elements Completed: 0.00195 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.00439 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 128x128 elements Completed: 0.00416 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.00903 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 128x128 elements Completed: 0.00269 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 2 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 2 ] [ Matrix Size: 128 x 128 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 128 x 128 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.00147 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 128x128 elements Completed: 0.00170 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.00391 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 128x128 elements Completed: 0.00441 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.00170 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 128x128 elements Completed: 0.00195 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.00439 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 128x128 elements Completed: 0.00439 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.00903 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 128x128 elements Completed: 0.00269 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 3 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 4 ] [ Matrix Size: 128 x 128 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 128 x 128 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.00170 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 128x128 elements Completed: 0.00172 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.00414 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 128x128 elements Completed: 0.00391 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.00170 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 128x128 elements Completed: 0.00195 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.00441 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 128x128 elements Completed: 0.00463 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.00903 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 128x128 elements Completed: 0.00269 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 4 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 8 ] [ Matrix Size: 128 x 128 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 128 x 128 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.00170 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 128x128 elements Completed: 0.00172 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.00391 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 128x128 elements Completed: 0.00464 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.00195 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 128x128 elements Completed: 0.00220 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.00439 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 128x128 elements Completed: 0.00464 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.00928 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 128x128 elements Completed: 0.00269 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 5 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 16 ] [ Matrix Size: 128 x 128 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 128 x 128 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.00184 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 128x128 elements Completed: 0.00195 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.00537 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 128x128 elements Completed: 0.00415 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.00305 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 128x128 elements Completed: 0.00220 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.00488 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 128x128 elements Completed: 0.00464 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.00916 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 128x128 elements Completed: 0.00280 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 6 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 1 ] [ Matrix Size: 256 x 256 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 256 x 256 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.01172 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 256x256 elements Completed: 0.01172 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.03392 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 256x256 elements Completed: 0.03589 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.01197 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 256x256 elements Completed: 0.01245 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.05053 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 256x256 elements Completed: 0.05055 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.07006 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 256x256 elements Completed: 0.02514 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 7 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 2 ] [ Matrix Size: 256 x 256 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 256 x 256 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.01172 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 256x256 elements Completed: 0.01172 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.03492 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 256x256 elements Completed: 0.03711 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.01197 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 256x256 elements Completed: 0.01245 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.05127 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 256x256 elements Completed: 0.05127 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.07008 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 256x256 elements Completed: 0.02514 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 8 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 4 ] [ Matrix Size: 256 x 256 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 256 x 256 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.01197 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 256x256 elements Completed: 0.01195 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.03516 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 256x256 elements Completed: 0.03736 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.01220 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 256x256 elements Completed: 0.01270 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.05127 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 256x256 elements Completed: 0.05127 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.07031 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 256x256 elements Completed: 0.02564 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 9 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 8 ] [ Matrix Size: 256 x 256 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 256 x 256 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.01270 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 256x256 elements Completed: 0.01245 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.03614 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 256x256 elements Completed: 0.03784 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.01270 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 256x256 elements Completed: 0.01319 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.05248 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 256x256 elements Completed: 0.05273 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.07006 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 256x256 elements Completed: 0.02539 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 10 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 16 ] [ Matrix Size: 256 x 256 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 256 x 256 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.01416 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 256x256 elements Completed: 0.01416 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.03833 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 256x256 elements Completed: 0.04102 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.01392 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 256x256 elements Completed: 0.01416 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.05517 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 256x256 elements Completed: 0.05519 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.07103 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 256x256 elements Completed: 0.02613 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 11 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 1 ] [ Matrix Size: 512 x 512 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 512 x 512 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.08741 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 512x512 elements Completed: 0.08738 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.33644 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 512x512 elements Completed: 0.34522 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.13475 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 512x512 elements Completed: 0.14013 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.42044 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 512x512 elements Completed: 0.41991 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.56444 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 512x512 elements Completed: 0.30078 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 12 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 2 ] [ Matrix Size: 512 x 512 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 512 x 512 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.13088 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 512x512 elements Completed: 0.13084 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.33644 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 512x512 elements Completed: 0.34472 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.13475 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 512x512 elements Completed: 0.14016 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.42138 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 512x512 elements Completed: 0.42091 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.56444 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 512x512 elements Completed: 0.30028 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 13 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 4 ] [ Matrix Size: 512 x 512 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 512 x 512 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.13037 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 512x512 elements Completed: 0.13134 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.33303 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 512x512 elements Completed: 0.34081 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.13575 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 512x512 elements Completed: 0.14013 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.41797 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 512x512 elements Completed: 0.41797 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.56541 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 512x512 elements Completed: 0.30078 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,667 Views
[ Test Case 14 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 8 ] [ Matrix Size: 512 x 512 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 512 x 512 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.13328 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 512x512 elements Completed: 0.13428 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.33447 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 512x512 elements Completed: 0.34325 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.13819 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 512x512 elements Completed: 0.14306 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.42041 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 512x512 elements Completed: 0.41994 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.56397 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 512x512 elements Completed: 0.30078 secs > Test1099 End < Tests: Completed
0 Kudos
SergeyKostrov
Valued Contributor II
3,595 Views
[ Test Case 15 - 32-bit / Cores: 1 / CPUs: 1 - Number of OpenMP threads - 16 ] [ Matrix Size: 512 x 512 ] Application - IccTestApp - WIN32_ICC ( 32-bit ) - Release Tests: Start > Test1099 Start < Matrix A, B and C Sizes : 512 x 512 Loop Processing Schema ( LPS ): IKJ Loop Blocking Divider : 1 Sub-Test 1.1 - MxMultA1 - Classic 2D LBOT size: N/A Completed: 0.13475 secs Sub-Test 1.2 - MxMultA2 - Classic 2D LBOT LBOT size: 512x512 elements Completed: 0.13575 secs Sub-Test 1.3 - MxMultA3 - Classic 2D Fused LBOT size: N/A Completed: 0.33056 secs Sub-Test 1.4 - MxMultA4 - Classic 2D Fused LBOT LBOT size: 512x512 elements Completed: 0.33788 secs Sub-Test 2.1 - MxMultB1 - Classic 2D Transposed LBOT size: N/A Completed: 0.13869 secs Sub-Test 2.2 - MxMultB2 - Classic 2D Transposed LBOT LBOT size: 512x512 elements Completed: 0.14306 secs Sub-Test 2.3 - MxMultB3 - Classic 2D Fused Transposed LBOT size: N/A Completed: 0.41897 secs Sub-Test 2.4 - MxMultB4 - Classic 2D Fused Transposed LBOT LBOT size: 512x512 elements Completed: 0.41897 secs Sub-Test 5.1 - MxMultD1 - Classic 1D LBOT size: N/A Completed: 0.56447 secs Sub-Test 5.2 - MxMultD2 - Classic 1D LBOT LBOT size: 512x512 elements Completed: 0.30078 secs > Test1099 End < Tests: Completed
0 Kudos
Reply