Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

Very Poor Concurrency, Threads Don't Wake Up?

I'm building a real-time simulation system in C++. The architecture was designed to be appropriate for parallel processing, but my concurrency is awful. The basic simulation loop uses one "sync" thread to set up the next frame (16.7msec) to be simulated, plus several "sim" threads which do the time stepping. Part of the prep phase for each frame is to break all the data into separable chunks, the goal being that each chunk can be processed completely independent from any other chunk. In the simulation phase, the multiple sim threads each take a chunk, process it (with no need to sync between threads) take the next chunk, etc. until all the chunks have been processed. The code architecture is thus something like the following:

while (true) { // Sync Thread


while (true) { // Sim Thread

while (getChunk()) {


Being designed for real-time, this stuff doesn't hit the disk or network during the simulation phase. I use a Slim Read/Write Lock and Condition Variables to pass control between the sync-thread and the multiple sim-threads. Since nothing else significant goes on while the sim-threads run, I have as many of them as there are logical cores. I've analyzed this code on both an i7 Quad with 4-cores, and a bigger Xeon server with 16-cores.

My problem is that on the 4-core machine, I typically see just 2 of the 4 sim-threads actually do any work. The other 2 sim-threads do no processing. As far as I can tell, they don't actually wake up and do any work until all the work is complete. On the 16-core machine, the performance is even worse. I typically see just 3 of the 16 sim-threads doing any work. The remaining 13 again appear to not run until all the work is complete.

I tried using different sync objects - instead of SRW locks, I used Events. There was no difference. I've used Parallel Studio to attempt to better understand what's going on. It indeed shows a lot of time spent by the sim-threads waiting to run, but I guess I don't understand what I'm seeing well enough to learn any more than this.

The reason I say that it appears the sim-threads don't wake up is that in each sim-thread I call QueryPerformanceCounter() immediately after the threads wake up. The threads that actually do work show 0.01 msec. The threads that don't do any work show 5-10 msec. Is this even an accurate measurement?

Any advice would be greatly appreciated. This seems like a good strategy - why doesn't it work?
0 Kudos
9 Replies
The i7 Quad has hyperthreading, so it has effectively 8 hardware threads. How do you determine the number of cores and the thread affinity mask to start your sim threads width?
0 Kudos

Sorry, I mispoke - the laptop I use is based upon an i7 2620M. This means it is a 2-core, 4-thread CPU. I use GetSystemInfo() and use the dwNumberOfProcessors member, so I create the same number of threads as the number of logical processors.

Similarly, the server I'm using has 2 - X5570 processors, so GetSystemInfo() reports 2x4x2 or 16 logical processors.

I do not specify any thread affinity mask for either my process or any of my threads. I believe this means that any thread can/will run on any processor?

I have run my tests with hyper-threading enabled/disabled in the BIOS. Since my app is creating threads based upon the number of logical cores, it thus switches between 2/4 and 8/16 threads respectively. In either case, I see the same behavior - many of my worker threads do no work. My conclusion from this is that even though hyper-threading doesn't fully duplicate all components of the execution pipeline, this is not the source of my problem.

0 Kudos
Black Belt
Try double buffering

while (true) { // Sync Thread


Jim Dempsey
0 Kudos
Just to make sure you have the threads available, check for the affinity mask:

DWORD_PTR affinityProcess;
DWORD_PTR affinitySystem;
GetProcessAffinityMask(GetCurrentProcess(), &affinityProcess, &affinitySystem);

In order to force it on all threads, set the affinitiy mask of each thread:

(n = number of sim thread, starting from 0)
DWORD_PTR affinity = 1L << n;
SetThreadAffinityMask(GetCurrentThread(), affinity);

If it looks fine now, maybe the threads don't have enough work, what is the CPU usage?
0 Kudos
Black Belt
Your Majesty -

I don't see anything wrong with your strategy. The double buffering idea from Jim Dempsey might be something to consider in order for the sim threads to not need to pause after a round of computation. But that's an idea to explore when you've got your concurrency metrics figured out.

My first question for you would be "Are all of the computations getting done?" If threads are sitting idle, there must be something that is not being done (or the computation is taking 5X longer than it should). Can you even know if that's the case?The second question is "Are you getting any speedup and, if so, is it something close to the core utilization you're experiencing?"

By computing each round in a jiffy, you're looking for each thread to simulate 60 frames per second. I'm thinking that maybe the problem isn't with your concurrency, but with your measuring tools. If the tools are polling the cores and happen to only look when most threads are idle, your measurements can be skewed.

Have you tried putting a counter increment within each thread's code? Before the actual simulation computation, each thread increments a private counter. That way, once the overall computation is complete, each thread can print their counter to show how many times the thread woke up and did some processing. If your simulation runs 5 seconds, each thread should process around 300 frames. If you find that some threads have 300 frames and others only have 1 or 2, then there would be something wrong with your threads waking up and getting a task to execute.

Do the computations within each round that is simulated actually run for most of the 16.7 msec? Or are some simulation computations much shorter? In the latter case, the total computation might be done, but since some tasks are quickly disposed of, the measuring tool for concurrency will "see" very little sustained activity.

Just some ideas off the top of my head.

0 Kudos
Hi Clay,

Thanks for the advice and suggestions.

I've attached a file containing some data that my instrumented app generated on a 16 (logical) core server. Each row is one "frame" of processing. The first number is the number of the frame. The subsequent 16 numbers are the number of "clusters" of data that each thread processed during the frame. Each frame, the app works until all clusters are processed. I took this sample after running for about one minute, so it is roughly in steady state. As you can see, the data is definitely split between the various threads, however I estimate that around 5-6 threads out of the 16 are idle each frame. I did as you suggested and generate a warning if a thread doesn't run at all. This warning is never generated, which furthers my belief that all 16 threads run, but several of them don't do anything - they don't actually get running until the other threads have done all the work.

I have the ability to run as many worker threads as desired. On the 16-core machine, I ran the same test but with 1, 2, 4, 8, and 16 worker threads. The relative performance times I generated were:
1 - 1.00
2 - 1.79
4 - 2.19
8 - 2.68
16 - 3.69
So, I'm definitely getting benefit from the extra cores, but it certainly isn't close to the utilization I expect.

Running Parallel Amplifier 2011 shows the poor utilization. The following is the thread concurrency and CPU usage stats from a run with 16 worker threads on the 16-core server. Something's definitely messed up.

Thanks again for any advice/insight you can give me,

Thread Concurrency Histogram

Simultaneously Running Threads Elapsed Time Utilization

0 0.0960909693 Idle

1 39.1101193839 Poor

2 0.1800625614 Poor

3 0.0587249692 Poor

4 0.0443305135 Poor

5 0.0391724917 Poor

6 0.052472974 Poor

7 0.0381684657 Poor

8 0.0445166912 Poor

9 0.0329449971 Ok

10 0.0337680921 Ok

11 0.0320512057 Ok

12 0.0293827837 Ok

13 0.0308441983 Ok

14 0.1137116819 Ideal

15 1.2951453943 Ideal

16 0.0873021418 Ideal

17 0.0479819701 Ideal

18 0.0001284343 Ideal

19 0 Over

20 0 Over

21 0 Over

22 0 Over

23 0 Over

24 0 Over

CPU Usage Histogram

Simultaneously Utilized Logical CPUs Elapsed Time Utilization

0 4.031892705 Idle

1 8.2473193015 Poor

2 29.0207201143 Poor

3 0.0521006096 Poor

4 0.0008353482 Poor

5 0.0040066626 Poor

6 0.0008112014 Poor

7 7.73402e-005 Poor

8 0.00016833 Poor

9 9.37885e-005 Ok

10 0.0010253752 Ok

11 0.0007870543 Ok

12 0.0022197799 Ok

13 0.0011338621 Ok

14 3.14964e-005 Ideal

15 0.00369695 Ideal

16 0 Ideal

0 Kudos
Black Belt
Joe -

It's hard to diagnose problems with just some raw numbers. What kinds of things did Parallel Amplifier show as poor utilization? Was there time spent waiting? Or did excessive synchronization time lead to the slowdowns you are seeing? And the synchronization may not be just locks used, but it could be the amount of time needed to coordinate between smaller and smaller workloads per thread as the number of threads increases.

Your speedup trend seems to indicate the latter. Or is the amount of work fixed per thread and as you use more threads, more work can be done simultaneously? For instance, if I have 1000 items to process independently, it can be done in about half the time by two threads and about 1/4 the time with four threads. When I get to 16 threads, there will only be 63 items per thread and the overhead of distributing those items can start to drag down the overall execution time such that the speedup will begin to flatten out.

Review what data you got from Parallel Amplifier to see if you can pinpoint better the cause of the slowdown from what you had expected. That's about the best advice I can give from not being able to look over your shoulder.

0 Kudos
Thanks for the advice and suggestions. i will try it.
0 Kudos
I do not think the 2#'solution works well.
0 Kudos