The Linuxkernal may be configured differently. One system places the threads as compact and the other as sparse.
From the description of your application it would seem like compact would be better.
Assuming you are using OpenMP
PThreads or Linux configuration program may have similar purposed settings
Your thread 1 comments state 1-3 milliseconds
Is there a timed wait or is this as fast as it goes?
Can you post your simple test program?
There is no such thing as 'optimal scheduling' on OS level. There is just some reasonable scheduling for an average program with typical patterns. Some scheduling decisions will speed-up some programs while slowing down others, and it you will try to incorporate some nontrivial intellect into an OS, it will slow down all programs.
In general, modern OSes prefer independent threads (which is better for scaling anyway), i.e. they try to distribute threads as far from each other as possible (this gives more resources to each thread - more execution units, more cache, more memory bandwidth). It seems that your program has not so independent threads (constant communication), so if you need optimal scheduling you need to take over the control of thread placement.
So, and where is thread scheduling involved here? How consumer threads are polling from the queue (blocking, active spinning, passive spinning, hybrid spinning, spinning then blocking)? If you are using mutex+condvar then change in behavior potentially is due to change in pthread implementation.
If you care about latency (HFT?) you must use active spinning, and then OS with it's scheduling is out the game. But then once again you need to pin your threads onto the same CPU to decrease communication latencies.
Btw, pipeline is an anti-pattern for low latency. If you want low latency what for in this world you requires a message to traverse several threads?
Once a thread wakeups on IO, it reads the message and processes it till completion (it's the lowest possible latency... of course if you can't do intra-message parallelization). Simultaneously another threads blocks on IO to read and process subsequent message. And so on.
The consumer threads are using the mutex+condition variable approach.
I never thought that the pthread implementation change could be the cause of what I'm seeing. But now that you mention it, it makes sense.
And that's really what I'm looking for: exactly what were the changes to libpthread, kernel, etc., that are affecting our particular case? I was biased towards the kernel being the culprit, given that the RHEL5 introduced Nehalem support; plus the fact that RHEL5+Core2 doesn't exhibit this behavior. Is libpthread aware of what CPU it's running on?
The comments in your previous post regarding "optimal" scheduling are interesting as well. I was admittedly too narrowly focused on our particular use case, and assumed optimal=what works best for us! :) But your point makes sense. My question then, is that really considered the typical case, i.e. that inter-thread communication is atypical, and therefore it makes the most sense to spread threads as far apart as possible? (Honest question, not trying to argue.)
One comment: note that thread 3 doesn't have any kind of doWork() functionality. Only thread 2 does "real" work. The other two basically just shuffle data around.
When you use the word "sockets" in the above post, are you using it in the general sense, or talking specifically about network sockets? I think you are using it in the abstract sense, but, just to be clear, the test program doesn't have any network sockets.
Your assumption about condition variables is true. There are no spin locks in this program.
I'm afraid I'm creeping towards your last statement. Although I would modify it to say, "it appears that latencies for condition variables and/or heap operations are conditionally different between CentOS4 and CentOS5." In other words, depending on what cores the threads execute on, latencies can be dramatically different. The problem is, I'm having trouble finding out detailed change information. My initial post was a bit of a long shot, hoping a Linux kernel expert hangs out here.
Anyway, I can certainly modify the program according to your suggestions to see if that causes any interesting changes (perhaps eliminate heap operations as a possible culprit).
Thank you for all your help!
If both patterns are typical, I think Linux should provide a tunable/configuration option that allows the sysadmin to specify the thread scheduling policy. Maybe it does? I know the I/O scheduler is configurable.
Like you said, the OS doesn't know anything about the applications. But if the application programmer and/or sysadmin can give the OS some info on typical application behavior, it could make better scheduling decisions.
Well, but an OS can reorder numbers as it wants... so it should depend not only on processor model but also on OS.
I expect to see the change from scatter order on Xeon 54xx to linear on 55xx... on what OS? On all OSes?
Isn't it what you have done with taskset? ;)
And as for finer-grained control a programmer can manually control thread affinities - bind threads that communicate to the same socket, bind independent threads to different processors or even NUMA nodes, some tight communications will require binding to HT-sibling threads, etc.
Yeah, taskset/sched_setaffinity() does what I need. :)
I was thinking more along the lines of a system-level configuration setting, like the configurable IO schedulers. Something like a sysctl that instructs the kernel to assume threads should be "densely" scheduled.