Linux thread scheduling on Nehalem

matt_garman · ‎10-19-2010

We have a lot of multi-threaded, event-driven processes that are more sensitive to latency than throughput.

We recently moved some of these processes from a CentOS 4.x server to one running CentOS 5.x (i.e. effectively the same as going from RHEL4 to RHEL5). While both machines are Nehalem-based, the CentOS5 box is newer (faster clock speed, dual hex-core vs dual quad-core).

We actually saw performance decrease.

Upon further investigation, I found that I could fix the performance problem by using taskset to force a program to run on the same physical CPU.

In other words, it appears that some Linux kernel change resulted in sub-optimal thread scheduling on machines with multiple physical Nehalem CPUs.

I wrote a simple program that simulates the behavior of the real production program. It has three threads and two shared queues. This is what it does:

Thread 1: in a continuous loop, about every 1--3 milliseconds:
- takes a timestamp using gettimeofday()
- creates an object about 100 bytes in size
- stores the timestamp in the object
- pushes the object into a thread-safe queue, called Queue A

Thread 2: in a continuous loop:
- pulls objects off of Queue A
- does some processing (a basic CRC calculation, works on fixed static, data, takes about 10 microseconds)
- creates a new object about 200 bytes in size
- copies the timestamp from the dequeued object to the new object
- deletes the dequeued object
- pushes the new object into another thread-safe queue, called Queue B

Thread 3: in a continuous loop:
- pulls object off of Queue B
- takes a timestamp using gettimeofday()
- writes the delta of the two timestamps to a fixed, pre-allocated array
- delete the dequeued object
- after a fixed number of objects are processed (usually 25,000), generate statistics on the timestamp deltas

I ran this program on four different configurations:
(1) Core2-based system running CentOS 4
(2) Core2-based system running CentOS 5
(3) Nehalem-based system running CentOS 4
(4) Nehalem-based system running CentOS 5

All systems have dual physical CPU packages. In all but the last case (CentOS5 + Nehalem), the average timestamp delta was about 20 microseconds, with low variance.

However, on the CentOS5+Nehalem machine, average timestamp delta was anywhere from 50 to 125 microseconds, with very high variance. However, as stated above, I could use taskset to force this test program to run on only one physical core, and get the results back in line with the other three configurations.

Note that for all testing, the machines were otherwise unloaded.

Does anyone have any thoughts on what might cause this discrepancy? Short of manually configuring all production processes with taskset, is there some kernel setting or system setting that can improve on this behavior?

Thank you!
Matt

jimdempseyatthecove · ‎10-19-2010

The Linuxkernal may be configured differently. One system places the threads as compact and the other as sparse.

From the description of your application it would seem like compact would be better.
KMP_AFFINITY=granularity=fine,compact

Assuming you are using OpenMP

PThreads or Linux configuration program may have similar purposed settings
--------------
Your thread 1 comments state 1-3 milliseconds
Is there a timed wait or is this as fast as it goes?

Can you post your simple test program?

Jim Dempsey

matt_garman · ‎10-19-2010

I'm not using OpenMP, I'm using pthreads. I'm not aware of a KMP_AFFINITY type setting for pthreads (although that doesn't mean one doesn't exist!). I can use taskset from the commandline to restrict the cores my program can use, or, within the program itself I can use sched_setaffinity() to bind the process (or even individual threads) to one or more cores.

The 1--3 ms in thread 1 is a timed wait. The real program receives data from the network at roughly that rate. So here I am trying to simulate that.

Unfortunately, I cannot post the test program, but I can describe it in more detail, if necessary.

-------------

Here are some more interesting observations I've made after further testing.

I modified the test program to allow CPU affinitization on a per-thread basis. My goal was to hopefully force the CentOS5 behavior on CentOS4. I was able to worsen the performance on CentOS4 by about 50% (20 usec avg service time to 30 usec avg service time). But I could not get the performance to degrade as much as I can on CentOS5 (service times around 100 usec, but varying from 50 to 150).

On both machines (CentOS4 and CentOS5), odd-numbered CPUs are physical package 0, and even-numbered CPUs are physical package 1.

So I basically ran the program with thread1 on cpu0, thread2 on cpu1, and thread3 on cpu2. In other words, force the threads to run on separate physical packages. Doing this on CentOS4 caused the 50% performance penalty; but doing this on CentOS5 caused the >100% performance penalty.

Thanks again!
Matt

TimP · ‎10-19-2010

The OpenMP library developers have said it is possible to invoke KMP_AFFINITY by making a single omp function call, such as omp_num_threads, which will set affinity for the pthreads (if running linux).
We've heard that RHEL4 (even the most recent updates) didn't have full support for Nehalem platforms, to the extent that RHEL5.3 and later do, so I'm not surprised you saw annoying changes.
I'm surprised that you had a machine with even numbered cores on one CPU and odd on the other. It's a good point that you must look for variations in the setup, and inexplicable variations occur. Much pre-existing software gets confused by HyperThreading, for example.

Dmitry_Vyukov · ‎10-19-2010

> I'm surprised that you had a machine with even numbered cores on one CPU and odd on the other

It's the behavior (scatter numbering) I am always observing on Linux boxes. While Windows seems to always choose compact numbering, I guess now it's a kind of implicit contract for Windows (processors 0 and 1 are the closest physically). I can't say for Linux, perhaps there are some boxes out there with compact numbering. But the conclusion I've made is that on Linux it's always better to consult with /proc/cpuinfo regarding processor numbering.

Dmitry_Vyukov · ‎10-19-2010

Quoting matt_garman

Upon further investigation, I found that I could fix the performance problem by using taskset to force a program to run on the same physical CPU.

In other words, it appears that some Linux kernel change resulted in sub-optimal thread scheduling on machines with multiple physical Nehalem CPUs.

There is no such thing as 'optimal scheduling' on OS level. There is just some reasonable scheduling for an average program with typical patterns. Some scheduling decisions will speed-up some programs while slowing down others, and it you will try to incorporate some nontrivial intellect into an OS, it will slow down all programs.

In general, modern OSes prefer independent threads (which is better for scaling anyway), i.e. they try to distribute threads as far from each other as possible (this gives more resources to each thread - more execution units, more cache, more memory bandwidth). It seems that your program has not so independent threads (constant communication), so if you need optimal scheduling you need to take over the control of thread placement.

Dmitry_Vyukov · ‎10-19-2010

Quoting matt_garman

Thread 1: in a continuous loop, about every 1--3 milliseconds:
- takes a timestamp using gettimeofday()
- creates an object about 100 bytes in size
- stores the timestamp in the object
- pushes the object into a thread-safe queue, called Queue A

Thread 2: in a continuous loop:
- pulls objects off of Queue A
- does some processing (a basic CRC calculation, works on fixed static, data, takes about 10 microseconds)
- creates a new object about 200 bytes in size
- copies the timestamp from the dequeued object to the new object
- deletes the dequeued object
- pushes the new object into another thread-safe queue, called Queue B

Thread 3: in a continuous loop:
- pulls object off of Queue B
- takes a timestamp using gettimeofday()
- writes the delta of the two timestamps to a fixed, pre-allocated array
- delete the dequeued object
- after a fixed number of objects are processed (usually 25,000), generate statistics on the timestamp deltas

So, and where is thread scheduling involved here? How consumer threads are polling from the queue (blocking, active spinning, passive spinning, hybrid spinning, spinning then blocking)? If you are using mutex+condvar then change in behavior potentially is due to change in pthread implementation.

If you care about latency (HFT?) you must use active spinning, and then OS with it's scheduling is out the game. But then once again you need to pin your threads onto the same CPU to decrease communication latencies.

Dmitry_Vyukov · ‎10-19-2010

Quoting Dmitriy Vyukov

So, and where is thread scheduling involved here? How consumer threads are polling from the queue (blocking, active spinning, passive spinning, hybrid spinning, spinning then blocking)? If you are using mutex+condvar then change in behavior potentially is due to change in pthread implementation.
If you care about latency (HFT?) you must use active spinning, and then OS with it's scheduling is out the game. But then once again you need to pin your threads onto the same CPU to decrease communication latencies.

Btw, pipeline is an anti-pattern for low latency. If you want low latency what for in this world you requires a message to traverse several threads?

Once a thread wakeups on IO, it reads the message and processes it till completion (it's the lowest possible latency... of course if you can't do intra-message parallelization). Simultaneously another threads blocks on IO to read and process subsequent message. And so on.

jimdempseyatthecove · ‎10-20-2010

Matt,

The reason I asked for the test program was to investigate some of the details that are hard to convey via a forum post. Maybe we can take this off line via email.

Part of your observed performance penalty may have nothing to do with the caching system, rather it could be due to the thread scheduling overhead as it applies to various sockets, between CentOS4 and CentOS5.

as expressed in your earlier post you have

thread 1
timed wait
read timer
allocate message block type 1
insert timer value into message block type 1
enqueue to thread 2 *note 1
loop

Thread 2
wait for message from thread 1 *note1
read message
doWork2()
allocate message block type 2
copy data from message type 1 to message type 2
enqueue to thread 3 *note2
delete message type 2
loop

Thread 3
wait for message type 2 *note2
read message type 2
doWork3()
read time value from message type 2
delete message type2
read time
compute delta time
save to global buffer
loop

note 1

It is unknown if threads 2 and 3 are in spin waits or waiting on condition variable. Due to Thread 1containing a timmer wait, I will have to assue threads 2 and 3 are waiting on condition variables. If this assumption is true, then thread 2 will experience a latency between the time of call by thread 1 to set condition variable and the time to execute the first instruction in thread 2 following the wait for condition variable. This latency, call it latency 1, appears to vary to a greater extent as to the juxtaposition of thread 1 vs thread 2 vs socket of thread 1 vs socket of thread 2 (vs socket of kernal service?).

Note 2

... (note 1)... This latency, call it latency 2, appears to vary to a greater extent as to the juxtaposition of thread2 vs thread3 vs socket of thread2 vs socket of thread3 (vs socket of kernal service?).

An additional latency, assuming above sketch was correct, is in the allocation/deallocation of memory for your messages. Some of this overhead can be eliminated by use of scalable allocator or by use of private message pools. However, additional latency can be reclaimed by reordering your statements (assuming above sketch was correct)

thread 1
allocate message block type 1 *** preallocate
timed wait
read timer
insert timer value into message block type 1
enqueue to thread 2 *note 1
loop

Thread 2
allocate message block type 2 *** preallocate
wait for message from thread 1 *note1
read message
doWork2()
copy data from message type 1 to message type 2
enqueue to thread 3 *note2
delete message type 2
loop

Thread 3
wait for message type 2 *note2
read message type 2
doWork3()
read time value from message type 2
read time *** task is done prior to delete
delete message type2
compute delta time
save to global buffer
loop

As to if/why latencies for condition variables (or generalized heap) is different between CentOS4 and CentOS5, this is beyond the scope of this forum.

Jim Dempsey

matt_garman · ‎10-20-2010

Quoting Dmitriy Vyukov

Quoting matt_garman

Thread 1: in a continuous loop, about every 1--3 milliseconds:
- takes a timestamp using gettimeofday()
- creates an object about 100 bytes in size
- stores the timestamp in the object
- pushes the object into a thread-safe queue, called Queue A

[ ... snipped ... ]
So, and where is thread scheduling involved here? How consumer threads are polling from the queue (blocking, active spinning, passive spinning, hybrid spinning, spinning then blocking)? If you are using mutex+condvar then change in behavior potentially is due to change in pthread implementation.
If you care about latency (HFT?) you must use active spinning, and then OS with it's scheduling is out the game. But then once again you need to pin your threads onto the same CPU to decrease communication latencies.

The consumer threads are using the mutex+condition variable approach.

I never thought that the pthread implementation change could be the cause of what I'm seeing. But now that you mention it, it makes sense.

And that's really what I'm looking for: exactly what were the changes to libpthread, kernel, etc., that are affecting our particular case? I was biased towards the kernel being the culprit, given that the RHEL5 introduced Nehalem support; plus the fact that RHEL5+Core2 doesn't exhibit this behavior. Is libpthread aware of what CPU it's running on?

The comments in your previous post regarding "optimal" scheduling are interesting as well. I was admittedly too narrowly focused on our particular use case, and assumed optimal=what works best for us! :) But your point makes sense. My question then, is that really considered the typical case, i.e. that inter-thread communication is atypical, and therefore it makes the most sense to spread threads as far apart as possible? (Honest question, not trying to argue.)

TimP · ‎10-20-2010

Quoting Dmitriy Vyukov

> I'm surprised that you had a machine with even numbered cores on one CPU and odd on the other

It's the behavior (scatter numbering) I am always observing on Linux boxes. While Windows seems to always choose compact numbering, I guess now it's a kind of implicit contract for Windows (processors 0 and 1 are the closest physically). I can't say for Linux, perhaps there are some boxes out there with compact numbering. But the conclusion I've made is that on Linux it's always better to consult with /proc/cpuinfo regarding processor numbering.

While it's difficult to observe the BIOS numbering of the logical processors on Windows without a tool such as comes with Intel MPI or openmpi (if that one works on Windows), or Intel OpenMP, this is (unfortunately?) not under the control of the OS. The last dual socket platform we saw which typically had a full scatter numbering with even cores on CPU 0 was the Harpertown Xeon 54xx. There were 2 variations on 54xx which couldn't be distinguished by reading /proc/cpuinfo. The full table is present in current linux distros, and can be viewed e.g. by 'irqbalance -debug' Supposedly, it should have been easier for Windows schedulers to optimize scheduling of 2 or 4 threads with that setup, but, evidently, it was more trouble than benefit.
Prototype Nehalem Xeon 55xx platforms had variations on how the sibling logicals were numbered, but most production EP platforms are numbered with 0 and 1 being siblings on core 0, and the cores on CPU 0 listed ahead of the cores on CPU 1, such that KMP_AFFINITY=compact,1 will give 1 logical per core in physical order, followed by the list of the second logicals (when HyperThreading is enabled).
Also, unfortunately, while it is usual for Core I7 desktops and Nehalem 55xx or Westmere 56xx EP servers to compact the list when HyperThreading is disabled, making the numbering consistent with CPUs from other vendors, there are other products where the numbering of active cores doesn't change when HyperThreading is disabled, leaving gaps in the sequence.
As far as I know, there is no table which tells which cores on Xeon 56xx share paths to cache, but that's normally cores [0,1],[2,3], [6,7],[8,9] (no scattering).

Dmitry_Vyukov · ‎10-20-2010

Quoting tim18

While it's difficult to observe the BIOS numbering of the logical processors on Windows... this is (unfortunately?) not under the control of the OS.

So numbering can be either way (compact/scatter) on both Windows and Linux, right?

Dmitry_Vyukov · ‎10-20-2010

> And that's really what I'm looking for: exactly what were the changes to libpthread, kernel, etc., that are affecting our particular case? I was biased towards the kernel being the culprit, given that the RHEL5 introduced Nehalem support; plus the fact that RHEL5+Core2 doesn't exhibit this behavior. Is libpthread aware of what CPU it's running on?

I dunno.

> My question then, is that really considered the typical case, i.e. that inter-thread communication is atypical, and therefore it makes the most sense to spread threads as far apart as possible?

AFAIK, Windows, Linux and Solaris have switched to "dustribute threads as far as possible" scheduling.
I think that both patterns are rather typical. However an OS has to make scheduling decisions w/o actually having any information about application, it's communication patterns, what threads communicate with what threads, phases of work, etc. And in this light, it's better to distribute threads, I think.

matt_garman · ‎10-20-2010

Quoting jimdempseyatthecove

Matt,

The reason I asked for the test program was to investigate some of the details that are hard to convey via a forum post. Maybe we can take this off line via email.

Part of your observed performance penalty may have nothing to do with the caching system, rather it could be due to the thread scheduling overhead as it applies to various sockets, between CentOS4 and CentOS5.

[ ... snip ... ]

It is unknown if threads 2 and 3 are in spin waits or waiting on condition variable. Due to Thread 1containing a timmer wait, I will have to assue threads 2 and 3 are waiting on condition variables. If this assumption is true, then thread 2 will experience a latency between the time of call by thread 1 to set condition variable and the time to execute the first instruction in thread 2 following the wait for condition variable. This latency, call it latency 1, appears to vary to a greater extent as to the juxtaposition of thread 1 vs thread 2 vs socket of thread 1 vs socket of thread 2 (vs socket of kernal service?).

Note 2

... (note 1)... This latency, call it latency 2, appears to vary to a greater extent as to the juxtaposition of thread2 vs thread3 vs socket of thread2 vs socket of thread3 (vs socket of kernal service?).

An additional latency, assuming above sketch was correct, is in the allocation/deallocation of memory for your messages. Some of this overhead can be eliminated by use of scalable allocator or by use of private message pools. However, additional latency can be reclaimed by reordering your statements (assuming above sketch was correct)

[ ... snip ... ]

As to if/why latencies for condition variables (or generalized heap) is different between CentOS4 and CentOS5, this is beyond the scope of this forum.

Jim Dempsey

One comment: note that thread 3 doesn't have any kind of doWork() functionality. Only thread 2 does "real" work. The other two basically just shuffle data around.

When you use the word "sockets" in the above post, are you using it in the general sense, or talking specifically about network sockets? I think you are using it in the abstract sense, but, just to be clear, the test program doesn't have any network sockets.

Your assumption about condition variables is true. There are no spin locks in this program.

I'm afraid I'm creeping towards your last statement. Although I would modify it to say, "it appears that latencies for condition variables and/or heap operations are conditionally different between CentOS4 and CentOS5." In other words, depending on what cores the threads execute on, latencies can be dramatically different. The problem is, I'm having trouble finding out detailed change information. My initial post was a bit of a long shot, hoping a Linux kernel expert hangs out here.

Anyway, I can certainly modify the program according to your suggestions to see if that causes any interesting changes (perhaps eliminate heap operations as a possible culprit).

Thank you for all your help!

Matt

TimP · ‎10-20-2010

The compact and scatter options of Intel KMP_AFFINITY have to unscramble whatever happens at BIOS and OS boot-up. I've heard that a change in numbering has been observed on prototype hardware when rebooting; if so, that would detract from my argument that the BIOS alone determines this.
As I said, you expect to see the change from scatter order on Xeon 54xx to linear on 55xx. Among possible reasons for the change would be that applications using 1 or 2 threads would not normally run out of memory bandwidth on Xeon 55xx; besides, Windows 7 had committed to improve scheduling by that time.

matt_garman · ‎10-20-2010

Quoting Dmitriy Vyukov

AFAIK, Windows, Linux and Solaris have switched to "dustribute threads as far as possible" scheduling.
I think that both patterns are rather typical. However an OS has to make scheduling decisions w/o actually having any information about application, it's communication patterns, what threads communicate with what threads, phases of work, etc. And in this light, it's better to distribute threads, I think.

If both patterns are typical, I think Linux should provide a tunable/configuration option that allows the sysadmin to specify the thread scheduling policy. Maybe it does? I know the I/O scheduler is configurable.

Like you said, the OS doesn't know anything about the applications. But if the application programmer and/or sysadmin can give the OS some info on typical application behavior, it could make better scheduling decisions.

Dmitry_Vyukov · ‎10-20-2010

Quoting matt_garman

When you use the word "sockets" in the above post, are you using it in the general sense, or talking specifically about network sockets? I think you are using it in the abstract sense, but, just to be clear, the test program doesn't have any network sockets.

I suspect by "sockets" he means processors (i.e. physical processor packages).

Dmitry_Vyukov · ‎10-20-2010

Quoting tim18

The compact and scatter options of Intel KMP_AFFINITY have to unscramble whatever happens at BIOS and OS boot-up. I've heard that a change in numbering has been observed on prototype hardware when rebooting; if so, that would detract from my argument that the BIOS alone determines this.
As I said, you expect to see the change from scatter order on Xeon 54xx to linear on 55xx. Among possible reasons for the change would be that applications using 1 or 2 threads would not normally run out of memory bandwidth on Xeon 55xx; besides, Windows 7 had committed to improve scheduling by that time.

Well, but an OS can reorder numbers as it wants... so it should depend not only on processor model but also on OS.

I expect to see the change from scatter order on Xeon 54xx to linear on 55xx... on what OS? On all OSes?

Dmitry_Vyukov · ‎10-20-2010

Quoting matt_garman

Like you said, the OS doesn't know anything about the applications. But if the application programmer and/or sysadmin can give the OS some info on typical application behavior, it could make better scheduling decisions.

Isn't it what you have done with taskset? ;)

And as for finer-grained control a programmer can manually control thread affinities - bind threads that communicate to the same socket, bind independent threads to different processors or even NUMA nodes, some tight communications will require binding to HT-sibling threads, etc.

TimP · ‎10-20-2010

In order to use taskset, or the similar interfaces provided by GOMP_CPU_AFFINITY, KMP_AFFINITY="proclist=....", I_MPI_PIN_PROCS,... you must know the numbering scheme as seen by linux. So, it is not possible at this level to make it portable across the schemes adopted by various BIOS writers. Once you have discovered the scheme on your platform, you are able to adjust your numbers accordingly.
Each new platform (more cores, ....) tends to break the automatic options as well as possibly requiring kernel tuning.

matt_garman · ‎10-20-2010

Quoting Dmitriy Vyukov

Quoting matt_garman

Like you said, the OS doesn't know anything about the applications. But if the application programmer and/or sysadmin can give the OS some info on typical application behavior, it could make better scheduling decisions.
Isn't it what you have done with taskset? ;)
And as for finer-grained control a programmer can manually control thread affinities - bind threads that communicate to the same socket, bind independent threads to different processors or even NUMA nodes, some tight communications will require binding to HT-sibling threads, etc.

Yeah, taskset/sched_setaffinity() does what I need. :)

I was thinking more along the lines of a system-level configuration setting, like the configurable IO schedulers. Something like a sysctl that instructs the kernel to assume threads should be "densely" scheduled.