I am facing a problem while programming in OpenMP. I am running my code on a server which has 48 cores and two sockets (each socket has 24 cores). The scalability of my program is good upto 24 OpenMP threads scheduled as static. But after that running time increases when I scale over 24 cores and it continues unto 48. I assume this problem is due to bandwidth limitation due to ccNUMA architecture. My question is how can I solve this problem in comparatively better way? It would be great to know and also thanks in advance.
The running time of my code is O(n^2). I am just putting openmp for loop directives before executing loop like following:
#pragma omp parallel for schedule(static)
The performance scaling on a NUMA system depends on what is limiting your application performance on a single node.
In most cases, it is critical to apply thread binding to make sense of the performance results.
Some common cases:
I don't recommend including "schedule(static)" unless you have a specific reason for that choice. I typically use the default (no "schedule" clause), or I use "schedule(runtime)" and control the schedule using the environment variable OMP_SCHEDULE.
To understand the impact of thread placement across sockets, I recommend running your scaling study using OMP_PROC_BIND=spread and again using OMP_PROC_BIND=close.
To understand the impact of memory placement across sockets, I recommend re-running your scaling study using:
Variations in performance across these cases will tell you a lot about what NUMA characteristics control your performance.