- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** Optimal Number of OpenMP threads for Memory Bound processing ***
[ Abstract ]
Here is a question: How many OpenMP threads need to be used in case of a Memory Bound processing?
I've completed some tests and it is clear that in case of a Memory Bound processing an optimal
number should not exceed a number of Hardware Threads of a CPU or it has to be equal to a number of
Memory Channels the CPU supports ( just 2 for most Intel CPUs ).
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Computer System used for performance evaluations ]
** Dell Precision Mobile M4700 **
Intel Core i7-3840QM ( 2.80 GHz )
Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/products/70846
32GB RAM
320GB HDD
NVIDIA Quadro K1000M ( 192 CUDA cores / 2GB memory )
Windows 7 Professional 64-bit SP1
Size of L3 Cache = 8MB ( shared between all cores for data & instructions )
Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions )
Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions )
Display resolution: 1366 x 768
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Microsoft C++ compiler ( 64-bit ) ]
Data Set Size : 7,969,177,600 SP elements ( Memory Allocated: 29.6875 GB )
Number of Iterations: 1
Number of Tests : 4
Number of Threads : 1
ALGORITHM_PREFETCH: 3.47475 secs
Number of Threads : 2
ALGORITHM_PREFETCH: 2.78875 secs
Number of Threads : 4
ALGORITHM_PREFETCH: 2.71450 secs
Number of Threads : 8
ALGORITHM_PREFETCH: 2.87825 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ MinGW C++ compiler ( 64-bit ) ]
Data Set Size : 7,969,177,600 SP elements ( Memory Allocated: 29.6875 GB )
Number of Iterations: 1
Number of Tests : 4
Number of Threads : 1
ALGORITHM_PREFETCH: 3.63450 secs
Number of Threads : 2
ALGORITHM_PREFETCH: 2.82350 secs
Number of Threads : 4
ALGORITHM_PREFETCH: 2.71825 secs
Number of Threads : 8
ALGORITHM_PREFETCH: 2.87450 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[ Intel C++ compiler ( 64-bit ) ]
Data Set Size : 7,969,177,600 SP elements ( Memory Allocated: 29.6875 GB )
Number of Iterations: 1
Number of Tests : 4
Number of Threads : 1
ALGORITHM_PREFETCH: 3.77525 secs
Number of Threads : 2
ALGORITHM_PREFETCH: 2.84325 secs
Number of Threads : 4
ALGORITHM_PREFETCH: 2.74575 secs
Number of Threads : 8
ALGORITHM_PREFETCH: 2.90950 secs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note 1: SP is a Single Precision floating point data type.
Note 2: Usage of CPUs for all these Test-Cases are as follows:
Number of Threads: 1 - CPU utilization ~12%
Number of Threads: 2 - CPU utilization ~24%
Number of Threads: 4 - CPU utilization ~48%
Number of Threads: 8 - CPU utilization ~98%
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note 3: Measured time intervals are in seconds and these are best results ( not averaged ) when a Peak of Performance for memory-bound processing was achieved.
Note 4: A portable method of Thread to CPU binding ( Not KMP-based ) was used at Run-Time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note 5: Thread to CPU binding schemas are as follows:
Number of Threads: 1 - 00->02
Number of Threads: 2 - 00->02, 01->04
Number of Threads: 4 - 00->02, 01->04, 02->06, 03->08
Number of Threads: 8 - 00->01, 01->03, 02->05, 03->07, 04->02, 05->04, 06->06, 07->08
where '00->02' means that OpenMP thread '00' was binded to a Logical CPU '02'.
Numbering of Logical CPU is from '01' to '08' for a CPU with four ( 4 ) Cores and
eight ( 8 ) Logical CPUs.
Thread to CPU binding schema for four threads matches to the following KMP-based Thread to CPU
binding schema:
KMP_AFFINITY=granularity=fine,proclist=[1,3,5,7],explicit,verbose
and its Run-Time output is as follows:
...
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {3}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {7}
...

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page