I would like to ask question about parallelization+vectorization:
1) Is it possible to implement parallelization+vectorization at the same time (i.e. access AVX in Sandy Bridge processor using multiple threads in parallel) using OpenMP parallel & SIMD constructs?
2) If yes, is it the parallelization+vectorization method (by using OpenMP) applicable to all other SIMD architectures (i.e. FMA4 in AMD & NEON tech in Raspberry Pi)?
I'm not sure whether this is the right forum to post question or not. Clarification about the problem is in need, though.
Depending on your specific question, any of the following may be relevant:
The ideal way to apply OpenMP with vectorization is the outer loop parallel/inner loop vectorized model.
For the platforms targeted by Intel compilers, you have available also constructs such as #pragma omp parallel for simd which can vectorize within parallel data chunks. Compilers for some of your targets may not actually implement this, although current gnu compilers accept the syntax.
Your question raises the suspicion that you didn't bother to open up your search tool to look for similar topics and refine your question.
Typically parallelization and vectorization are two separate things that has, in my opinion, no cross affection.
Best results I achieved if manuall vectorization is used , because only developer know insights details.
For parallelization: I've achieved best results with parallel_invoke (and not with other easier to use constructs such parallel_for or #pragma parallel for) construct.
Also it's very interesting to see details of parallelization (i.e. with VTune Aplifier), i.e. how many time task-start and task-sync takes into account compared to time consumed by tasks itself.
I've tried parallelization with OpenMP (#pragma ...) and result was very very slow - much slower as without parallelization, but parallelization with parallel_invoke done the right job.
Also think about, that massive parallelization can be a problem for cache system, especially if memory organization is not adjusted for parallelization. Take into account memory alignment for data access - this may huge improve your latency timing.