Vectorization and auto-parallelization are two different things. In the processor architecture there exists instructions that can operate on multiple data in one instruction. SSE3 instructions are one class of these instructions. The name for these types of instructions is Single Instruction Multiple Data (SIMD). The SIMD instructions can process2, 4 or 8 data items at a time (1, 3, 5, 6 and 7 when ignoring some of the results). The number of data items is dependent on the size of the data item (and may change with newer processor architectures). The group of data items is called a vector, thus the term vectorization is used to indicate SIMD (vectorization is easier to pronounce).
So vectorization will generally speed-up even a single threaded application, at least most of the time. SIMD instructions can impose data alignment requirements. Application data may or may not be aligned. When the compiler cannot know the data alignment it will insert synchronization code to determine if the data is aligned and additional code to advance the loop to a point of alignment. Due to this sometimes vectorization will slow down an application. Compiler options and/or directives in source code (#pragma for C++ or cDEC$.. for Fortran) can provide hints as to enable/disable vectorization as well as hints as to if the initial state of the data (vector) is aligned and thus avoid the requirement for synchronization code.
As to why your code runs slower with auto-parallelization and/or OpenMP we cannot say without seeing a sample of your code.
If the graularity of the parallelized loop is too small, you would get benefit with parallelization? Send me (email@example.com) the code, I can take a look to see how you can get performance gain on C2D.
BTW, vectorization is for exploiting SSSE3, SSE3, SSE2, i.e. vector-level parallelism
parallelization is for exploiting thread-level parallelism on multicorel.