I have a parallel algorithm that runs very well on a 2x Xeon Duo (Core 2) if both threads are placed on the same die, and terribly if they're placed on seperate dies. Unfortunately, Windows wants to place them on seperate dies, probably due to load balancing or something. I notice that this issue was recently discussed in a hypothetical way on this forum by Tim Mattson (http://softwareforums.intel.com/en-us/forums//topic/51654#35284):
By the way, the technical issues behind your question are a frequent topic of conversation among designers of multi-threaded languages.... The problem is with the caches. If Ive gone to great lengths to fill my caches with the data they need and then the OS migrates my threads to improve the load balance, my performance could suffer due to all the extra cache misses. This will be even more important on systems with complex cache hierarchies (such as a multiple socket system with multi-core processors in each socket). Hence, you may someday see changes in multithreading APIs to address this issue and somehow lock threads down to processors. This is a controversial topic, however, and it will take a while to work out how APIs need to change (if at all) to address this problem.
Incidentally, I recently read a paper ("Architectural considerations for efficient software execution on parallel microprocessors" - try googling it) that quantifies this effect - it is considerable.
So it's nice to know that people are considering this problem. Are there any solutions I can use now to ensure that my two threads share a cache? I believe that, in theory, I could use the CPUID instruction to retrieve the processor topology and that would tell me whether I need to lock threads down at all. However, I don't know of any way to say "you may schedule these threads anywhere you like as long as they're together." Does such an instruction exist?
Does anyone know if Linux has any better solution to this problem? Or is it a problem at all (ie, does it recognize shared caches and reschedule threads accordingly)? I have not yet had the time to test myself or go looking through the man pages for possible solutions.
Now that Vista is out the door, does anyone know when the next opportunity will be to start introducing cache-aware thread scheduling? I assume it can only be done at the OS level. I know NUMA was retroactively added to Windows XP in SP2 so perhaps there's hope either the scheduler can be updated or an API can be added to help this problem as well.
You could write a cache layoutdiscovery routine and assume something about the intentions of the O/S, and lock your app to the favorable processors.
For eachserver your application will run on then it might be easiest to set the affinity mask by trial and error. The affinity mask is to be passed to the application by an environment variable or a registry location.If you use the registry location then the application could periodically consult the location, and if changed, migrate to the new set of processors.
An alternative route would be to add code to dynamically discover favorable affinity settings for the system it is running on. (someone has probably filed aPatent on this).
a) The application starts without any affinity restrictions.
b) The application contains performance counters for each thread and the counters are updated at a convenient synchronization point. Without significant overhead to the app.
c) In addition to this each thread statistically samples which processor it is running on.
d) After a warming up period a determination is made as to which processor was used most for the application. This will indicate which processor was available most for the application.
e) Determine the application thread with the most processing time and set affinity to that processor. Note, if that thread is not one of the threads that is co-dependent on sharing data then you may want to pick a thread with co-dependency to place in this processor.
f) Using knowledge of the application as to which threads benefit from sharing the same cache, pick the next thread that is a partner of the first thread chosen in e)
g) Use a hunting strategy for the most favorable other processor. e.g. from the list of available processors, exclude the chosen processors (only 1 chosen at this time), pick the nearest available processor to the left (affinity bit wise).
h) Run for a while, collecting elapse time for the application.
i) nearest available processor to the right (affinity bit wise).
j) Run for a while, collecting elapse time for the application.
k) Continue this process until either you see a dramatic difference in elapse time, and take an early exit. Or have tried all otherprocessor combinations and pick the one with the lowest runtime.
l) Repeat f) through k) for additional threads.
Once the application has assigned all such threads you will have favorable collection of processors at this particular moment. Store this information for future reference.
As the application runs, periodically monitor the performance of the application. If it diminishes beyond a threshold you choose then repeat the affinity search routine. As you discover the favorable affinity settings for the moment keep track of what the affinity settings were and the frequency each setting occurred. Then for this system this table can be used to in lieu of the search technique.
Note, the above algorithm is a starting point. It would need to detect the condition of the most available processor being least desirable for cache sharing (my not have other processors sharing or other cache sharing processors affinity locked to other application).
The dynamic approach has advantages:
1) If some other application, or the OS, decides to lock on to one of the processors in your set, your application can move to a different set of most favorable
2) When you application changes or the hardware platform changes you do not need to revisit how to discover the next gen cache layout.
3) Your application as it sits now, and with the data set it uses now, may find it favorable to share the same cache for a given set of threads. Should your data set grow or change in layout or change in algorithm then cache sharing may no longerbe favorable.
I do not personally use the dynamic approach since I have a dedicated server. Some of my applicationrun times have been several 100's of hours. The overhead of the monitoring and migration code would be negligible. Essentially you would be adding QoS (quality of service). If the application detects a drop in service it can spend a little time determining the most favorable conditions.
For you current case of wanting to run on your current system. 2x Dual Core and where you may want to run two instances of the application. The startup of the application can spend a little time (say 1ms) determining which processor it runs most on. Then from this information, select the cache sharing partner that you know in advance. Either (0/1, 2/3) or (0/2, 1/3). Set the affinity to the partners of which your initialization code ran on. Note, should another instance of your application be running then the initialization code will likely be running on one of the processors not in the set of the other application. Now the two instances of the application run on different pairs of processors. Also, three instances will run on the most favorable sets, and four instances most favorable sets...
What you wouldn't have though without a little more coding is when applications retire, the set of favorable processor pairs may change.
That's a lot to think about, thanks - I'll look into some of those techniques. Pity the OS doesn't handle all that yet but I guess we'll just need to wait a while until it does...
Certain OpenMP implementations include means for specifying a preference for sharing cache (as must be done to use all cores of current CPUs) or different cache. Improved performance is likely when adjacent threads share a cache. On multiple socket multiple core platforms, the advantage of using all paths to memory would normally outweigh advantage of sharing cache, unless thereis false sharing.
On current linux distros, taskset is available to set cache affinity outside the program. On Windows, if you want to avoid use of the 2nd core or logical HT processor, without rebooting, youcould usethe affinity check boxes.