Linux thread scheduling on Nehalem - Page 2

matt_garman · ‎10-19-2010

We have a lot of multi-threaded, event-driven processes that are more sensitive to latency than throughput.

We recently moved some of these processes from a CentOS 4.x server to one running CentOS 5.x (i.e. effectively the same as going from RHEL4 to RHEL5). While both machines are Nehalem-based, the CentOS5 box is newer (faster clock speed, dual hex-core vs dual quad-core).

We actually saw performance decrease.

Upon further investigation, I found that I could fix the performance problem by using taskset to force a program to run on the same physical CPU.

In other words, it appears that some Linux kernel change resulted in sub-optimal thread scheduling on machines with multiple physical Nehalem CPUs.

I wrote a simple program that simulates the behavior of the real production program. It has three threads and two shared queues. This is what it does:

Thread 1: in a continuous loop, about every 1--3 milliseconds:
- takes a timestamp using gettimeofday()
- creates an object about 100 bytes in size
- stores the timestamp in the object
- pushes the object into a thread-safe queue, called Queue A

Thread 2: in a continuous loop:
- pulls objects off of Queue A
- does some processing (a basic CRC calculation, works on fixed static, data, takes about 10 microseconds)
- creates a new object about 200 bytes in size
- copies the timestamp from the dequeued object to the new object
- deletes the dequeued object
- pushes the new object into another thread-safe queue, called Queue B

Thread 3: in a continuous loop:
- pulls object off of Queue B
- takes a timestamp using gettimeofday()
- writes the delta of the two timestamps to a fixed, pre-allocated array
- delete the dequeued object
- after a fixed number of objects are processed (usually 25,000), generate statistics on the timestamp deltas

I ran this program on four different configurations:
(1) Core2-based system running CentOS 4
(2) Core2-based system running CentOS 5
(3) Nehalem-based system running CentOS 4
(4) Nehalem-based system running CentOS 5

All systems have dual physical CPU packages. In all but the last case (CentOS5 + Nehalem), the average timestamp delta was about 20 microseconds, with low variance.

However, on the CentOS5+Nehalem machine, average timestamp delta was anywhere from 50 to 125 microseconds, with very high variance. However, as stated above, I could use taskset to force this test program to run on only one physical core, and get the results back in line with the other three configurations.

Note that for all testing, the machines were otherwise unloaded.

Does anyone have any thoughts on what might cause this discrepancy? Short of manually configuring all production processes with taskset, is there some kernel setting or system setting that can improve on this behavior?

Thank you!
Matt

Dmitry_Vyukov · ‎10-20-2010

Quoting matt_garman

I was thinking more along the lines of a system-level configuration setting, like the configurable IO schedulers. Something like a sysctl that instructs the kernel to assume threads should be "densely" scheduled.

IMHO it would make sense.

In one program I used OpenMP to start threads with the code:

putenv("KMP_AFFINITY=scatter");
# pragma omp parallel for schedule(static, 1) num_threads(thread_count)
for (size_t thread_idx = 0; thread_idx < thread_count; thread_idx += 1)
...

and then in another phase of the program I have:

putenv("KMP_AFFINITY=compact");
# pragma omp parallel for schedule(static, 1) num_threads(thread_count)
for (size_t thread_idx = 0; thread_idx < thread_count; thread_idx += 1)
...

jimdempseyatthecove · ‎10-20-2010

Matt,

To expand on Dmitriy's post...

The code as you layed out is called a pipeline. However, it should be noted that although each pipe (stage) of your pipeline can run in parallel with other pipes of your pipeline, the design does not permit individual pipes to run in parallel with itself.

There is a different way to program this problem called a parallel pipeline. Both TBB and QuickThread have parallel_pipeline capability and you can do this with pthreads with a little more work.

Diagrams of your current pipeline code and parallel_pipeline

[bash]PThread pipeline as written Low message rate:
 ---- time -----}
---thread 1---} (1) one message percolating through queue
              ---thread 2---} (1) one message percolating through queue
                              ---thread 3---}


 
High Message Rate:

 
---thread 1---}
(n) n messages percolating through queue
---thread 2---}
(m) m messages percolating through queue
---thread 3---}


===============



Parallel Pipeline Low Message Rate:
---task 1---}
           (1) one message percolating through queue
            ---task 2---}
                      (1) one message percolating through queue
                       ---task 3---}

 
Slightly longer than low message rate for pthread pipeline
Parallel Pipeline at faster than low message rate: 
(latency for condition variable and thread suspend/start removed/reduced) 
---task 1---}
       (1) one message percolating through queue
        ---task 2---}
              (1) one message percolating through queue
                ---task 3---}
Depending on O/S and thread library this may be faster than your current pthread implementation.



Parallel Pipeline High Message Rate:
---task 1---}
(n-x) n messages percolating through queue
    ---task 2---} (first of several threads)
   ---task 2---} (second of several threads)
  ... 
---task 2---} (x'th thread of several threads)
(m-y) m messages percolating through queue
    ---task 3---> (first of several threads)
   ---task 3---> (second of several threads)
  ...
 ---task 3---> (y'th thread of several threads) [/bash]

Where:

task 1 is a serialized task performing the blocking I/O (simulated in your test program)
task 2 code under moderate to high loads can run multiple instances (tasks) concurrently on multiple threads
task3code under moderate to high loads can run multiple instances (tasks) concurrently on multiple threads

I took the liberty of diagramming using "task n" instead of "task n code" since pictorially it doesn't mess up the diagram.

Task 3 code, as diagrammed is a compute bound task. If task 3 were to be performing I/O then it may be a requirement to serialize this code (as was for first stage of pipeline).

While I cannot attest to the internals of the TBB parallel_pipeline, I can attest to the internals ofthe QuickThread parallel_pipeline.

In the very low messaging rate for both TBB and QuickThread, the task pool threads would likely be suspended/resumed and thus when a message comes in and is enqueued, the program would suffer a latency of:

+
+

Whereas the traditional pthread pipeline, as originally coded, would only have the kernel latency to wake up thread.
Therefor, for low messaging rates, a parallel_pipeline has unnecessary overhead.

Let's look at the latencies for moderate message rates for parallel_pipeline:

+

Under moderate messaging rates, the kernel latency to wake up threads is eliminated. Whereas in your current code, you still suffer the .

As to if the remaining latencies is longer or shorter than I cannot say. And in any event this would depend on the particular version of the O/S and processor architecture, and socket juxtapositioning.

Now let's look at the high messaging rates for parallel_pipeline:

TBB: I cannot attest to the internals of TBB. It may or may not contain:

+

QuickThread has highly optimized parallel_pipeline:

+

Where each latency is:

+< InterlockedIncrement + write to ring buffer>

The above latency, when the pipeline is not stalled, is very short.

The interesting thing about parallel_pipelines (both TBB and QuickThread) is that as throughput steps up, average latency reduces. Also, when your I/O stage exceeds the serial processing time of any one pipe (stage) of the pipeline, then the individual pipes can run in parallel - meaning little or no throttling of I/O stage.

The QuickThread threading library has two classed of threads compute and I/O whereas TBB has but one class. Additionally, with QuickThread, you can conditionally precondition the pipeline to use a specific socket(s) or have NUMA node sensitivity.

Jim Dempsey

jimdempseyatthecove · ‎10-20-2010

Sorry for the formatting mess on my prior post. This forum message editor has problems with text containing angle brackets. I know this is an html thing, but this is also a programming forum. Angle brackets in text body should be default. If you want html tags then this should be performed using an optional button e.g. a button to insert html tag at cursor. html during composition should be the exception not the rule. grumble, grumble grumble Jim Dempsey

matt_garman · ‎10-21-2010

Quoting jimdempseyatthecove

There is a different way to program this problem called a parallel pipeline. Both TBB and QuickThread have parallel_pipeline capability and you can do this with pthreads with a little more work.

Diagrams of your current pipeline code and parallel_pipeline

[ ... snip ... ]

The interesting thing about parallel_pipelines (both TBB and QuickThread) is that as throughput steps up, average latency reduces. Also, when your I/O stage exceeds the serial processing time of any one pipe (stage) of the pipeline, then the individual pipes can run in parallel - meaning little or no throttling of I/O stage.

The QuickThread threading library has two classed of threads compute and I/O whereas TBB has but one class. Additionally, with QuickThread, you can conditionally precondition the pipeline to use a specific socket(s) or have NUMA node sensitivity.

Jim Dempsey

Hi Jim,

Thank you for that detailed explanation/suggestion.

Is it fair to say that the gist of the parallel_pipeline is that stages of the pipeline themselves are multi-threaded?

The whole design, as originally posted, is based on the idea that the processing in Thread 2 might sometimes be slower than the message rate (i.e. in Thread 1). This made me think about a couple things: for the case where it's known that the arrival rate will always be less than or equal to the processing rate, the queues are unnecessary; the whole process could be serialized for simplicity. But the low message rate case isn't really interesting. :)

For higher message rates, I'm not sure a parallel_pipeline would always work in our case. In my simplified test program, it would. However, in real life, the ordering of the messages must be maintained. Generally they come in to the process correctly ordered, and they must absolutely leave the process correctly ordered. So while it seems that parallel_pipeline would reduce latencies/increase throughput at the pipeline level, there would be some additional overhead to ensure correct output ordering. Perhaps the ordering process itself could be another stage in the pipeline? That may or may not be a deal-breaker; I think it would have to be experimentally determined.

That is, assuming my understanding of parallel_pipeline is correct! :)

Thanks again!

Matt