Optimization Question

xcomponent · ‎07-10-2010

Hello, Friends!
Some information for start:

We have 2d array, let's say like this:

unsigned char array[128][128];

and auxiliary array of pointers to the 'array's rows like this:

unsigned char *parray[128]; //initialized with '=&array[0....127][0];

Now... In C those 2 local pointers expressions are supposed to be absolutely equivalent:

unsigned char *a = &array[0];
and
unsigned char*a = parray;

Now for the reference to 'a[]'. Most compilers with /O2 optimization flag on will translate it as:

"a" - reference to the 'y' column of the 'x' row, where y is a dynamic (code flow) variable.
to
BYTE PTR [edx+eax]
(registers in use are not important - could be edi instead an so on) where edx contains the address of 'x' and eax - the value 'y', i.e. dynamic displacement value.
What puzzles me is the translation of the intel x86 compiler, which gives different results of translation with the different initialization of the 'a' ptr:

when unsigned char*a = parray;
then the reference a is translated to:
BYTE PTR [edx+eax]

when unsigned char *a = &array[0];
then the reference a is translated to:
[_array + X + eax], where _array - start address and X - displacement.

In other words, in the first case the compiler uses Base + Index (indexed mode), in the second - Base + Index + Displacement (indexed + displacement).
This happens only with the intel x86 compiler and none of any other I've used. So my question is whether I should use the first mode with one register busy with the address (edx in that sample), or the second one that adds extra bytes to the reference.
Any suggestions are welcomed!
Thanks in advance!

mecej4 · ‎07-10-2010

You may, in addition, ask whether slightly different code is produced if you replace the C statement

unsigned char *a = &array[0];

by the equivalent statement

unsigned char *a = array;

followed by an expression in which a is used.

xcomponent · ‎07-10-2010

Ok, Thanks for the input. I will try to illustrate it better with a sample code.
I will use an array of structures, because it's closer to my code and more visible due to the usage of Scaled addressing mode. The initialization of the array is just to avoid some listing issues and excluded unusable code.
The most significant index to the array (with value of 10 passed as a test_1 and test_2 parameter) is randomly picked. The conditional statements is to just show better the reference.

[bash]typedef unsigned char byte;
typedef struct
{  byte x;
   byte y;
}ts;

ts array[128][128];
ts *parray[128];

void init()
{
  int i,j;
  for(i = 0; i < 128; i++)
  { parray = &array[0];
    for(j = 0; j < 128; j++)
    {  array.x = (byte)i;
       array.y = (byte)j;
    }
  }
}

int test_2(int index)
{ 
  int i;
  ts *t = parray[index];
  for(i = 0; i < 128; i++)
   if(t.x & 2 || t.y & 4)
     return 1;
  return 0;
}

int test_1(int index)
{ 
  int i;
  ts *t = &array[index][0];
  for(i = 0; i < 128; i++)
   if(t.x & 2 || t.y & 4)
     return 1;
  return 0;
}


int main()
{	
  init();
  test_1(10);
  test_2(10);
  return 0;
}
[/bash]

And the listing for the test_1 and test_2 functions:

[bash]; -- Begin  _test_2
; mark_begin;
       ALIGN     16
	PUBLIC _test_2
_test_2	PROC NEAR 
; parameter 1: 4 + esp
.B2.1:                          ; Preds .B2.0

;;; { 
;;;   int i;
;;;   ts *t = parray[index];

  00000 8b 15 28 00 00 00  mov edx, DWORD PTR [_parray+40]  

;;;   for(i = 0; i < 128; i++)
	
  00006 33 c0             xor eax, eax                           
                          ; LOE eax edx ebx ebp esi edi
.B2.2:                    ; Preds .B2.4 .B2.1

;;;    if(t.x & 2 || t.y & 4)
	
  00008 f6 04 42 02      test BYTE PTR [edx+eax*2], 2           
  0000c 75 12            jne .B2.7 ; Prob 20%                   
                                ; LOE eax edx ebx ebp esi edi
.B2.3:                          ; Preds .B2.2
  0000e f6 44 42 01 04   test BYTE PTR [1+edx+eax*2], 4         
  00013 75 0b            jne .B2.7 ; Prob 20%                  
                                ; LOE eax edx ebx ebp esi edi
.B2.4:                          ; Preds .B2.3
  00015 40               inc eax                              
  00016 3d 80 00 00 00   cmp eax, 128                         
  0001b 7c eb            jl .B2.2 ; Prob 99%                 
                                ; LOE eax edx ebx ebp esi edi
.B2.5:                          ; Preds .B2.4

;;;      return 1;
;;;   return 0;

  0001d 33 c0            xor eax, eax                 
  0001f c3               ret               
                                ; LOE
.B2.7:                          ; Preds .B2.2 .B2.3             ; Infreq
  00020 b8 01 00 00 00   mov eax, 1                          
  00025 c3               ret                               
  00026 8d 76 00 8d bc 
        27 00 00 00 00   ALIGN     16
                                ; LOE
; mark_end;
_test_2 ENDP
;_test_2	ENDS


; -- Begin  _test_1
; mark_begin;
       ALIGN     16
	PUBLIC _test_1
_test_1	PROC NEAR 
; parameter 1: 4 + esp
.B3.1:                          ; Preds .B3.0

;;; { 
;;;   int i;
;;;   ts *t = &array[index][0];
;;;   for(i = 0; i < 128; i++)

  00000 33 c0            xor eax, eax 
                                ; LOE eax ebx ebp esi edi
.B3.2:                          ; Preds .B3.4 .B3.1

;;;    if(t.x & 2 || t.y & 4)

  00002 f6 04 45 00 0a 
        00 00 02         test BYTE PTR [_array+2560+eax*2], 2   
  0000a 75 15            jne .B3.7 ; Prob 20%                  
                                ; LOE eax ebx ebp esi edi
.B3.3:                          ; Preds .B3.2
  0000c f6 04 45 01 0a 
        00 00 04         test BYTE PTR [_array+2561+eax*2], 4   
  00014 75 0b            jne .B3.7 ; Prob 20%                
                                ; LOE eax ebx ebp esi edi
.B3.4:                          ; Preds .B3.3
  00016 40               inc eax                               
  00017 3d 80 00 00 00   cmp eax, 128                         
  0001c 7c e4            jl .B3.2 ; Prob 99%                   
                                ; LOE eax ebx ebp esi edi
.B3.5:                          ; Preds .B3.4

;;;      return 1;
;;;   return 0;

  0001e 33 c0            xor eax, eax                           
  00020 c3               ret                                  
                                ; LOE
.B3.7:                          ; Preds .B3.2 .B3.3             ; Infreq
  00021 b8 01 00 00 00   mov eax, 1                            
  00026 c3               ret                                  
  00027 8b f6 8d bc 27 
        00 00 00 00      ALIGN     16
                                ; LOE
; mark_end;
_test_1 ENDP
;_test_1	ENDS[/bash]

p.s. It seems that HTML code editor replaces the i < 128, j < 128 with "i<128" and "j <128", I apologize for that, just don't know how to post it properly I guess.

jimdempseyatthecove · ‎07-10-2010

Run a benchmark using each method. On iterations 2:128 the loop will be in the L1 instruction cache. When the array data is in memory (or L3 cach, or maybe L2 cache) the extra few bytes in the instruction might have no effect on performance.

The performance difference will vary depending on processor archetecture.

Jim Dempsey

jimdempseyatthecove · ‎07-10-2010

I might add, do the benchmark in the manner in which you useexpect to reference the data.
If your application references separate arrays of data, or uses the same array but with different values each time, then broaden your test application to include this behavior.

Jim

xcomponent · ‎07-11-2010

Ok, to summarize things. Running loop benchmarks for the posted simple code and the real world application shows that the usage of auxiliary array of pointers is clearly faster. Assume:EDX with the address from the value of the local ptr variable stays though. Unfortunately, Loop benchmarks in my case are useless, whether the tested code is the real and complex one, or the simpler that I've posted. So I've benchmarked the real world application with and without the ptr array, and the results show that without using the auxiliary array, the application is 1.5-2% faster, mainly due to the intesive usage of that part of the code. I've tested it on 2 different CPU's with the same result. I would like to underline, that the code difference exists only with intel x86. The main point is that the two methods - using local var, which contains an address from auxiliary array of pointers to an array's rows and using local var with the address of an array's row are supossed to be equivalent, but this is not the case here and depending on the code and/or hardware, one of those two will be faster.
Thank you!