I am facing a problem while programming in OpenMP. I am running my code on a server which has 48 cores and two sockets (each socket has 24 cores). The scalability of my program is good upto 24 OpenMP threads scheduled as static. But after that running time increases when I scale over 24 cores and it continues unto 48. I assume this problem is due to bandwidth limitation due to ccNUMA architecture. My question is how can I solve this problem in comparatively better way? It would be great to know and also thanks in advance.
The running time of my code is O(n^2). I am just putting openmp for loop directives before executing loop like following:
#pragma omp parallel for schedule(static)
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Parallel Computing
The performance scaling on a NUMA system depends on what is limiting your application performance on a single node.
In most cases, it is critical to apply thread binding to make sense of the performance results.
Some common cases:
- If the code is memory-bandwidth-limited and has good locality, then you just need to ensure that the data structures are initialized using the same thread that is going to be accessing the data most frequently.
- Typically you just need to duplicate your OMP parallel for pragma on the loop that initializes the data arrays. An example is line 267 of http://www.cs.virginia.edu/stream/FTP/Code/stream.c (ignore the comment at line 266, it should be below that loop, not above it).
- On some systems, automatic NUMA page migration will re-map physical addresses so that after some seconds of execution time, data will be moved to become local to the thread that uses it the most.
- If the code is dominated by random or global cache-to-cache transfers, it may be difficult to get performance improvements when using multiple sockets. This is just "physics" -- cross-chip transfers are slower than on-chip transfers.
- If the code is dominated by very small loops, then the overhead of the implicit OpenMP barriers will make scaling challenging, and will cause a drop in performance when the threads spread across multiple sockets. In this case you generally need to try to apply the parallelization at a larger granularity in the code.
I don't recommend including "schedule(static)" unless you have a specific reason for that choice. I typically use the default (no "schedule" clause), or I use "schedule(runtime)" and control the schedule using the environment variable OMP_SCHEDULE.
To understand the impact of thread placement across sockets, I recommend running your scaling study using OMP_PROC_BIND=spread and again using OMP_PROC_BIND=close.
To understand the impact of memory placement across sockets, I recommend re-running your scaling study using:
- Memory limited to the local socket: numactl --membind=0 --cpunodebind=0 a.out # with up to 24 threads
- Memory limited to the remote socket: numactl --membind=1 --cpunodebind=0 a.out. # with up to 24 threads
- Memory interleaved across the two sockets: numactl --interleave=0,1 a.out # with up to 48 threads
Variations in performance across these cases will tell you a lot about what NUMA characteristics control your performance.