Solved: Your question is a good one.

Gilad_D_Intel · ‎11-06-2015

The following code is taken from https://software.intel.com/en-us/code-samples/intel-c-compiler/application-domains/finance/averaging-filter
here is the serial code which is not hard to explain

Linear Version of the code

for(int i = 1; i < (h+1); i++) 
{ 
    int x = ((resized_width * i) + 1); 
    for(int j = x; j < (x + w); j++) 
    {
        unsigned int red = 0, green = 0, blue = 0;
        for(int k1 = (-1); k1 <= 1; k1++)
        { 
            int pos = j + (k1 * resized_width);
            for(int k2 = (-1); k2 <= 1; k2++) 
            {
                red += resized_indataset[(pos + k2)].red; 
                green += resized_indataset[(pos + k2)].green; 
                blue += resized_indataset[(pos + k2)].blue; 
            }
        } 
        resized_outdataset[j].red = red/9; 
        resized_outdataset[j].green = green/9; 
        resized_outdataset[j].blue = blue/9; 
    }
}

Array Notation

for(int i = 1; i < (h+1); i++) 
{
    int x = ((resized_width * i) + 1);
    for(int j = x; j < (x + w); j+=8) 
    {
        int row2index = 3*j - 3;
        int jump = (resized_width * 3);
        int row1index = row2index - jump;
        int row3index = row2index + jump;
        out[(row2index + 3):24] = (in[row1index:24] + in[row1index+3:24] + in[row1index+6:24] + in[row2index:24] + in[row2index+3:24] + in[row2index+6:24] + in[row3index:24] + in[row2index+3:24] + in[row2index+6:24])/9; 
    }
}

I'm trying to understand the code but I do not understand that basics assumption here.

In the following code

  for(int j = x; j < (x + w); j+=8)
        {
            int row2index = 3*j - 3;
            int jump = (resized_width * 3);
            int row1index = row2index - jump;
            int row3index = row2index + jump;
            out[(row2index + 3):24] = (in[row1index:24] + in[row1index+3:24] + in[row1index+6:24] + in[row2index:24] + in[row2index+3:24] + in[row2index+6:24] + in[row3index:24] + in[row3index+3:24] + in[row3index+6:24])/9;
        }

a.how do you decide that j needs to jump to j+=8 ? why 8?
b.why does jump = (resized_width * 3)?
c.why do we have index +0, index+3 ,index +6 ??

ARCH_R_Intel · ‎11-06-2015

Your question is a good one. This is the first time I've seen this code, and did not see any comments explaining it, so I was puzzled too.

Answer to a: The code appears to be trying to use array notation to seduce the compiler into generating vector code, by type-punning slices of 8 {r,g,b} structures onto vectors of length 24. Why 24? The vector length needs to be a multiple of 3 because the r,g,b structure has 3 elements. The widest type in each calculation is a 32-bit integer (because of C++ promotion rules). An AVX instruction can do 8 32-bit operations at once. 24 is the least common multiple of 8 and 3, and so works out well for AVX; i.e. three AVX instructions can operate on a 24-element slice. If the target machine had only SSE instructions, which can do only 4 32-bit operations at once, then maybe 12 would be a better value. 24 would still be workable, but might raise the register pressure badly, or might help by improving instruction-level parallelism. The only way to be sure is to time the variations -- modern hardware is too complex to predict. That's why when I write this sort of code, I make the 24 a symbolic constant and time the variations.

Answer to b: The jump is in units of the type-punned pointer to a color component (one of r, g, or b), not the original pointer to a struct containing a triple {r, g, b}.

Answer to c: See answer to be for why the offsets are multiples of 3. Though a more literal translation of the serial code's offsets -1, 0, 1, would rescale them to -3, 0, 3. For reasons not apparent to me, the code shifts the offsets upwards by 3, and compensates by burying a -3 in this line:

int row2index = 3*j - 3;

View solution in original post

ARCH_R_Intel · ‎11-06-2015

Your question is a good one. This is the first time I've seen this code, and did not see any comments explaining it, so I was puzzled too.

Answer to a: The code appears to be trying to use array notation to seduce the compiler into generating vector code, by type-punning slices of 8 {r,g,b} structures onto vectors of length 24. Why 24? The vector length needs to be a multiple of 3 because the r,g,b structure has 3 elements. The widest type in each calculation is a 32-bit integer (because of C++ promotion rules). An AVX instruction can do 8 32-bit operations at once. 24 is the least common multiple of 8 and 3, and so works out well for AVX; i.e. three AVX instructions can operate on a 24-element slice. If the target machine had only SSE instructions, which can do only 4 32-bit operations at once, then maybe 12 would be a better value. 24 would still be workable, but might raise the register pressure badly, or might help by improving instruction-level parallelism. The only way to be sure is to time the variations -- modern hardware is too complex to predict. That's why when I write this sort of code, I make the 24 a symbolic constant and time the variations.

Answer to b: The jump is in units of the type-punned pointer to a color component (one of r, g, or b), not the original pointer to a struct containing a triple {r, g, b}.

Answer to c: See answer to be for why the offsets are multiples of 3. Though a more literal translation of the serial code's offsets -1, 0, 1, would rescale them to -3, 0, 3. For reasons not apparent to me, the code shifts the offsets upwards by 3, and compensates by burying a -3 in this line:

int row2index = 3*j - 3;

Gilad_D_Intel · ‎11-06-2015

Thanks for the quick answer but i'm still a little puzzled.

so if you run the linear version I get in pos+k2

0 1 2
3266 3267 3268
6532 6533 6534

1 2 3
3267 3268 3269
6533 6534 6535

2 3 4
3268 3269 3270
6534 6535 6536

3 4 5
3269 3270 3271
6535 6536 6537

and so on for the next 8 runs - we said 8 because of 3*8 =24 for the AVX
each position has RGB channels we are calculating resized_outdataset where
j = the middle 2,2 index starting from 3267, 3268 for the next 3x3

so we iterate 8 times over 3 char* which are the RGB channels for each position

when I use the AN version I get totally different indices

row1index = 0
row2index = 9798
row3index = 19596

Can you please explain this? even the assignment in the end of the loop is into row2Index which is not the same index as j.

I would expect the syntax here to say

(in[row1index:24] + in[row1index+3:24] + in[row1index+6:24]

iterating over 8*3 chars

the first index was 0 1 2 for the first iteration for pos+k2 so 0 == row1index when pos+k2 =1 this means you need to skip 3 chars because of the RGB channels so we get to row1index+3 and so on for index+6 when pos+k2=2

but when pos+k2 = 3266 , 3267, 3268 for the first iteration the AN version syntax says

in[row2index:24] + in[row2index+3:24] + in[row2index+6:24]

and row2index starts from 9798.... so i'm lost here

ARCH_R_Intel · ‎11-06-2015

The index j in the serial code is an index into an array of struct, where there are 3 bytes per struct. The row...index variable is indexing on the scale of bytes. So there's a 3x ratio, i.e. 9798/3 = 3266.

Gilad_D_Intel · ‎11-07-2015

Thanks very much, now I understand the whole code. what a weird way of programming....

Image processing average filter sample code