This was posted on the Premier section. I think an additional post here is warranted.
When running multiple threads with IVF either by way of OpenMP or optimizations while on a system with a Hyperthreading capable processor an application suffers a significant performance hit if the placement of the stack pointer causes excessive numbers of cache conflicts. When using C/C++ a workable hack remedy is to use alloca to consume some stack space in an effort to reduce the cache conflicts. Unfortunately alloca is not available to FORTRAN users. Attempts in FORTRAN to distribute the stacks add unnecessary code. It is my observation that the stack alignment could be made part of the OpenMP implementation.
The determination of when in or when out of Hyperthreading and how best to align the stack is best made from within the code that spawns the threads. Performing this in application code is difficult and targets an application to a platform.
Further complicate this by the internal design of the cache system. Determining the optimal actions is best suited as part of the startup code and then integrated into the thread management code of OpenMP.
When looking at the preamble code of subroutines one of the current optimization techniques is to align esp to a paragraph boundary prior to allocating stack space for local variables. When running under OpenMP and on a Hyperthreading processor a few instructions added to this esp alignment code is all that is necessary to assure optimal stack alignment.
The dual core processors have the cache aliasing boundary moved out to where even the Windows threads default supposedly doesn't create a problem, up to 4 threads. As you say, Intel OpenMP adjusts stack bases automatically so that stack placement is not a problem, even on early Xeon models, and it really is not practical if the library doesn't take care of this. I think you'll find general agreement that the application can't be expected to detect the platform details and manage stack placement in the case where 8 or more threads are requested of it. So, it's difficult for me to understand if you are advocating a change in attitude, a change in OpenMP implementations, or abandoning all current multi-threading systems.
I am suggestiing that the OpenMP implimentation be sensitive to the platform it is running on to the extent that if stack alignment makes a difference then it should also have a means (e.g. environment or compile time option) to enable stack alignment when necessary.
Note, a Dual Core with Hyperthreading e.g. 840 Extream will have potential stack problems when running 4 threads. i.e. if you have an application that runs slower with 2 threads on a single core HT processor than with a single thread on that same processor then it would be expected to run slower with 4 threads on a Dual Core HT processor than with 2 threads when HT is disabled or bypassed.
Well, I suppose that if you find Intel OpenMP provoking 4M aliasing by accidental non-optimum stack placement, you are welcome to file a premier.intel.com issue with a reproducer demonstrating it.
Your last statement is probably 95% true. Most of the problems which would prevent HT from gaining performance when running a single core package would become worse on a dual core package. An exception: disabling HT on the single core processor doubles the size of ITLB available to a single thread.