Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Loop Vectorization 01

Royi
Novice
373 Views

Hello,

I have a Vectorization optimization problem.
I have a struct pDst which have 3 fields named: 'red', 'green' and 'blue'.
The type might be 'Char', 'Short' or 'Float'.This is given and can not be altered.
We have another array pSrc which represents an image [RGB] - Namely an array of 3 pointers which every one of them point to a layer of an image.Each layer is built using IPP plane oriented image (Namely, Each plane is formed independently - 'ippiMalloc_32f_C1'):http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch3/functn_Malloc.html

We would like to copy it as described in the following code:

for(int y = 0; y < imageHeight; ++y)
{
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = pSrc[0][x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green = pSrc[1][x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = pSrc[2][x + y * pSrcRowStep];
    }

 

Yet, in this form the compiler can't vectorize the code.
At first it says: "loop was not vectorized: existence of vector dependence.".
When I use the  #pragma ivdep to help the compiler (Since there's no dependence) I get the following error:"loop was not vectorized: dereference too complex.".

Anyone has an idea how to allow vectorization?

I have Intel Compiler 13.0.

Thanks.

0 Kudos
3 Replies
Royi
Novice
373 Views
Hi Jennifer, You referred me to an article where it says how to enforce vectorization. Where I want to understand why isn't it automatically vectorized. Maybe there is a different form I should arrange the data to yield the needed vectorization. Thank You.
0 Kudos
Mark_S_Intel1
Employee
373 Views
The compiler does not vectorize the code because it believes vectorizing it would be inefficient mostly due to use of Array of Structure access (AOS) which requires generations of gather/scatter instructions that are slow relative to use of linear access instructions: struct X { float red, green, blue; }; struct X *restrict pDst; float *restrict pSrc[3]; void foo(int imageHeight, int imageWidth, int pDstRowStep, int pSrcRowStep){ int x, y; for (y = 0; y < imageHeight; y++){ for (x = 0; x < imageWidth; x++){ pDst[x + y * pDstRowStep].red = pSrc[0][x + y * pSrcRowStep]; pDst[x + y * pDstRowStep].green = pSrc[1][x + y * pSrcRowStep]; pDst[x + y * pDstRowStep].blue = pSrc[2][x + y * pSrcRowStep]; } } } $ icc -vec-report2 -c -restrict vec10.cpp -V Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.0.079 Build 20120731 Copyright (C) 1985-2012 Intel Corporation. All rights reserved. vec10.cpp(11): (col. 4) remark: loop was not vectorized: vectorization possible but seems inefficient. vec10.cpp(10): (col. 3) remark: loop was not vectorized: not inner loop. Section 5.3 about SoA vs AoS at http://software.intel.com/en-us/articles/a-guide-to-auto-vectorization-with-intel-c-compilers/ might give some helpful information about this. You can bypass the compiler's cost benefit analysis and have it vectorize the loop by using the -vec-threshold0 option, but the code may run slow for default vectorization target which is SSE2: $ icc -vec-report2 -c -restrict vec10.cpp -V -vec-threshold0 Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.0.079 Build 20120731 vec10.cpp(11): (col. 4) remark: LOOP WAS VECTORIZED. vec10.cpp(10): (col. 3) remark: loop was not vectorized: not inner loop. The compiler can vectorize the code for AVX without the use of -vec-threshold0 option but I am not sure if it will give much speed up compare to the non-vectorized version: $ icc -vec-report2 -c -restrict vec10.cpp -V -xAVX Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 13.0.0.079 Build 20120731 vec10.cpp(11): (col. 4) remark: LOOP WAS VECTORIZED. vec10.cpp(10): (col. 3) remark: loop was not vectorized: not inner loop.
0 Kudos
Reply