08-27-2009 02:21 AM
The -openmp compiler option is used to recongnize OpenMP directives and to generate multithreaded code. The option will also triger parallel optimization phase in the compiler.
08-27-2009 09:40 AM
Do you mean by one core as single core in a processor (which means 1/8 or 1/4 or 1/2 as per Intel processors).
If you mean this, than purpose of OpenMP is to peform COARSE-GRAINED parallelism across cores in a processor for a SMP system. If you only wish to use single core, I would not suggest going for OpenMP but rather perform VECTORIZATION or FINE-GRAINED PARALLELISMto have better use of SSE registers than x87 stack which can give you effective Data-parallelism.
Disabling other cores in a SMP system apart fom single core can be done by THREAD-AFFINITY concept of Operating-system.
08-27-2009 09:44 AM
As qouted by you "compiler utilises in some way the information that iterations of the outer loop can be executed in parallel?".
Could you share the complete compiler command given to execute this code.
08-27-2009 11:24 AM
pushing optimization issues asside
A potential reason for difference (but not for your circumstance) is when forcing a loop to run with more threads than hardware threads (e.g. two on system one hardware thread), could be the outer loop is cut in half then the inner loop(s) is(are) run. Depending on cache locality, running portions of the larger loop on single hardware thread machine may yield better cache locality of data.
In your case, I do not believe this to be so. Your outer loop is likely run through the OpenMP version of the code as one piece (totality) of loop. Placement of the code may have more to do with differences (~1%)
08-27-2009 02:26 PM
I am aware that OpenMP is not suited for use with one thread/core (single core in a processor). However, during speedups measurements (for one and more cores used), I have observed this strange behavior (superlinear speedup on one core) and I am wondering what is the reason.
Here it is, but I am not sure if this is what you mean:
Could you expand this statement? What do you mean by "placement of the code" ?
08-27-2009 04:52 PM
08-27-2009 10:35 PM
"placement of code" - well code and/or data placement matters
The cache system typically reads and writes data in lines. Cache lines are (today) 64 bytes long and are transfered singly or in pairs (depending on system). If an often used small structure is aligned such that it crosses cache lines, reading and writing will affect two cache lines and performance is diminished. A similar thing happens when data access cross page boundaries. The system has a very small number of TLBs (translation look aside buffers). Often a small code change (adding/removing a line or two) can cause +/- 1% performance change. Sometimes much larger.