This is a question received by Intel Software Network Support, followed by the response provided by our Application Engineering team:
Q. I have a problem when I try parallelization with OpenMP on a Quad Core (processor X5355).
When I'm using a single processor the running time is 48 minutes, if I use four processors the running time is 23 minutes, but if I use 5, 6,7, 8 the running time is the same, 21 MIN!!! When i'm using 5 to 8 processors the parallelization time does not decrease. What is the problem?
A. Performance scaling is a complex topic. Much depends on the details of the algorithm.
How much concurrency is exposed by the algorithm? If there are only 2 tasks that execute concurrently, for example, then you wont be able to see any speedup beyond 2. Also, you cant scale beyond the number of cores (unless you get secondary memory effects leading to super-linear speedup). So if you have 4 cores, you wont usually speed up as you add more than 4 threads. In fact, adding more threads than you have cores will often slow you down due to the resulting excess thread scheduling overhead.
But it gets even trickier. Even if you have hundreds of potentially concurrent tasks, you have to worry about how much of the compute time is inherently serial. This is the issue of Amdahls Law. Amdahls Law says that the maximum speedup you can see is one over the serial fraction of the problem. For example, if 25% of your runtime is consumed by work that doesnt speed up as cores are added, then the maximum speedup is 1/(0.25) or 4.
But it gets even trickier still. How much parallel overhead do you have? If I am using locks inefficiently or if my algorithm has many barriers or other constructs that impose a serial order, my scalability will suffer.
But thats not all. You have memory movement issues to consider. If the data needed by a thread is in a different cache or in RAM, then you will need to move that data into a local cache. Even if you have data that is not shared, if data needed by one or more threads spans a number of cache lines, youll get excess cache line movement that will eat up performance.
The bottom line is that parallel scalability is complex. In many cases, to understand why your program scales a particular way, you need to profile the multi-threaded execution using Intels profiling tools.