What generates the question is the following: I was using openmp to thread a program. So I use a non-threaded version as reference. Nothing unusual there. Then I happen to switch on parllelization in the compiler.
Run time without parallelization is 174 seconds for my test.
Run time with the compiler parallelization on is 201 seconds. OK it is costing me more than it's worth.
So I turn parallelization back off and then turn on the !$omp directives in the main subrountine that costs me run time plus make turning the openmp switch. The compiler output now shows all the subrountines being parallelized/vectorized. How do I keep the compiler from vectorizing all the other subrountines that don't have !$omp directives? While there is a huge amount of documentation, the interconnections that appear to exist for some settings aren't clearly captured in an obvisious place.
Thanks for your assistance.
Vectorization and auto-parallelization are two different things. In the processor architecture there exists instructions that can operate on multiple data in one instruction. SSE3 instructions are one class of these instructions. The name for these types of instructions is Single Instruction Multiple Data (SIMD). The SIMD instructions can process2, 4 or 8 data items at a time (1, 3, 5, 6 and 7 when ignoring some of the results). The number of data items is dependent on the size of the data item (and may change with newer processor architectures). The group of data items is called a vector, thus the term vectorization is used to indicate SIMD (vectorization is easier to pronounce).
So vectorization will generally speed-up even a single threaded application, at least most of the time. SIMD instructions can impose data alignment requirements. Application data may or may not be aligned. When the compiler cannot know the data alignment it will insert synchronization code to determine if the data is aligned and additional code to advance the loop to a point of alignment. Due to this sometimes vectorization will slow down an application. Compiler options and/or directives in source code (#pragma for C++ or cDEC$.. for Fortran) can provide hints as to enable/disable vectorization as well as hints as to if the initial state of the data (vector) is aligned and thus avoid the requirement for synchronization code.
As to why your code runs slower with auto-parallelization and/or OpenMP we cannot say without seeing a sample of your code.
If the graularity of the parallelized loop is too small, you would get benefit with parallelization? Send me (email@example.com) the code, I can take a look to see how you can get performance gain on C2D.
BTW, vectorization is for exploiting SSSE3, SSE3, SSE2, i.e. vector-level parallelism
parallelization is for exploiting thread-level parallelism on multicorel.