Software Archive
Read-only legacy content
17061 Discussions

Optimal Number of OpenMP threads for Memory Bound processing

SergeyKostrov
Valued Contributor II
592 Views
*** Optimal Number of OpenMP threads for Memory Bound processing *** [ Abstract ] Here is a question: How many OpenMP threads need to be used in case of a Memory Bound processing? I've completed some tests and it is clear that in case of a Memory Bound processing an optimal number should not exceed a number of Hardware Threads of a CPU or it has to be equal to a number of Memory Channels the CPU supports ( just 2 for most Intel CPUs ).
0 Kudos
7 Replies
SergeyKostrov
Valued Contributor II
592 Views
[ Computer System used for performance evaluations ] ** Dell Precision Mobile M4700 ** Intel Core i7-3840QM ( 2.80 GHz ) Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846 32GB RAM 320GB HDD NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory ) Windows 7 Professional 64-bit SP1 Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) Display resolution: 1366 x 768
0 Kudos
SergeyKostrov
Valued Contributor II
592 Views
[ Microsoft C++ compiler ( 64-bit ) ] Data Set Size : 7,969,177,600 SP elements ( Memory Allocated: 29.6875 GB ) Number of Iterations: 1 Number of Tests : 4 Number of Threads : 1 ALGORITHM_PREFETCH: 3.47475 secs Number of Threads : 2 ALGORITHM_PREFETCH: 2.78875 secs Number of Threads : 4 ALGORITHM_PREFETCH: 2.71450 secs Number of Threads : 8 ALGORITHM_PREFETCH: 2.87825 secs
0 Kudos
SergeyKostrov
Valued Contributor II
592 Views
[ MinGW C++ compiler ( 64-bit ) ] Data Set Size : 7,969,177,600 SP elements ( Memory Allocated: 29.6875 GB ) Number of Iterations: 1 Number of Tests : 4 Number of Threads : 1 ALGORITHM_PREFETCH: 3.63450 secs Number of Threads : 2 ALGORITHM_PREFETCH: 2.82350 secs Number of Threads : 4 ALGORITHM_PREFETCH: 2.71825 secs Number of Threads : 8 ALGORITHM_PREFETCH: 2.87450 secs
0 Kudos
SergeyKostrov
Valued Contributor II
592 Views
[ Intel C++ compiler ( 64-bit ) ] Data Set Size : 7,969,177,600 SP elements ( Memory Allocated: 29.6875 GB ) Number of Iterations: 1 Number of Tests : 4 Number of Threads : 1 ALGORITHM_PREFETCH: 3.77525 secs Number of Threads : 2 ALGORITHM_PREFETCH: 2.84325 secs Number of Threads : 4 ALGORITHM_PREFETCH: 2.74575 secs Number of Threads : 8 ALGORITHM_PREFETCH: 2.90950 secs
0 Kudos
SergeyKostrov
Valued Contributor II
592 Views
Note 1: SP is a Single Precision floating point data type. Note 2: Usage of CPUs for all these Test-Cases are as follows: Number of Threads: 1 - CPU utilization ~12% Number of Threads: 2 - CPU utilization ~24% Number of Threads: 4 - CPU utilization ~48% Number of Threads: 8 - CPU utilization ~98%
0 Kudos
SergeyKostrov
Valued Contributor II
592 Views
Note 3: Measured time intervals are in seconds and these are best results ( not averaged ) when a Peak of Performance for memory-bound processing was achieved. Note 4: A portable method of Thread to CPU binding ( Not KMP-based ) was used at Run-Time.
0 Kudos
SergeyKostrov
Valued Contributor II
592 Views
Note 5: Thread to CPU binding schemas are as follows: Number of Threads: 1 - 00->02 Number of Threads: 2 - 00->02, 01->04 Number of Threads: 4 - 00->02, 01->04, 02->06, 03->08 Number of Threads: 8 - 00->01, 01->03, 02->05, 03->07, 04->02, 05->04, 06->06, 07->08 where '00->02' means that OpenMP thread '00' was binded to a Logical CPU '02'. Numbering of Logical CPU is from '01' to '08' for a CPU with four ( 4 ) Cores and eight ( 8 ) Logical CPUs. Thread to CPU binding schema for four threads matches to the following KMP-based Thread to CPU binding schema: KMP_AFFINITY=granularity=fine,proclist=[1,3,5,7],explicit,verbose and its Run-Time output is as follows: ... OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {1} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {3} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {5} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {7} ...
0 Kudos
Reply