Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Optimization Question

xcomponent
Beginner
347 Views
Hello, Friends!
Some information for start:

We have 2d array, let's say like this:

unsigned char array[128][128];

and auxiliary array of pointers to the 'array's rows like this:

unsigned char *parray[128]; //initialized with '=&array[0....127][0];

Now... In C those 2 local pointers expressions are supposed to be absolutely equivalent:

unsigned char *a = &array[0];
and
unsigned char*a = parray;

Now for the reference to 'a[]'. Most compilers with /O2 optimization flag on will translate it as:

"a" - reference to the 'y' column of the 'x' row, where y is a dynamic (code flow) variable.
to
BYTE PTR [edx+eax]
(registers in use are not important - could be edi instead an so on) where edx contains the address of 'x' and eax - the value 'y', i.e. dynamic displacement value.
What puzzles me is the translation of the intel x86 compiler, which gives different results of translation with the different initialization of the 'a' ptr:

when unsigned char*a = parray;
then the reference a is translated to:
BYTE PTR [edx+eax]

when unsigned char *a = &array[0];
then the reference a is translated to:
[_array + X + eax], where _array - start address and X - displacement.

In other words, in the first case the compiler uses Base + Index (indexed mode), in the second - Base + Index + Displacement (indexed + displacement).
This happens only with the intel x86 compiler and none of any other I've used. So my question is whether I should use the first mode with one register busy with the address (edx in that sample), or the second one that adds extra bytes to the reference.
Any suggestions are welcomed!
Thanks in advance!
0 Kudos
5 Replies
mecej4
Honored Contributor III
347 Views
You may, in addition, ask whether slightly different code is produced if you replace the C statement

unsigned char *a = &array[0];

by the equivalent statement

unsigned char *a = array;

followed by an expression in which a is used.
0 Kudos
xcomponent
Beginner
347 Views
Ok, Thanks for the input. I will try to illustrate it better with a sample code.
I will use an array of structures, because it's closer to my code and more visible due to the usage of Scaled addressing mode. The initialization of the array is just to avoid some listing issues and excluded unusable code.
The most significant index to the array (with value of 10 passed as a test_1 and test_2 parameter) is randomly picked. The conditional statements is to just show better the reference.

[bash]typedef unsigned char byte;
typedef struct
{ byte x;
byte y;
}ts;

ts array[128][128];
ts *parray[128];

void init()
{
int i,j;
for(i = 0; i < 128; i++)
{ parray = &array[0];
for(j = 0; j < 128; j++)
{ array.x = (byte)i;
array.y = (byte)j;
}
}
}

int test_2(int index)
{
int i;
ts *t = parray[index];
for(i = 0; i < 128; i++)
if(t.x & 2 || t.y & 4)
return 1;
return 0;
}

int test_1(int index)
{
int i;
ts *t = &array[index][0];
for(i = 0; i < 128; i++)
if(t.x & 2 || t.y & 4)
return 1;
return 0;
}


int main()
{
init();
test_1(10);
test_2(10);
return 0;
}
[/bash]
And the listing for the test_1 and test_2 functions:

[bash]; -- Begin  _test_2
; mark_begin;
ALIGN 16
PUBLIC _test_2
_test_2 PROC NEAR
; parameter 1: 4 + esp
.B2.1: ; Preds .B2.0

;;; {
;;; int i;
;;; ts *t = parray[index];

00000 8b 15 28 00 00 00 mov edx, DWORD PTR [_parray+40]

;;; for(i = 0; i < 128; i++)

00006 33 c0 xor eax, eax
; LOE eax edx ebx ebp esi edi
.B2.2: ; Preds .B2.4 .B2.1

;;; if(t.x & 2 || t.y & 4)

00008 f6 04 42 02 test BYTE PTR [edx+eax*2], 2
0000c 75 12 jne .B2.7 ; Prob 20%
; LOE eax edx ebx ebp esi edi
.B2.3: ; Preds .B2.2
0000e f6 44 42 01 04 test BYTE PTR [1+edx+eax*2], 4
00013 75 0b jne .B2.7 ; Prob 20%
; LOE eax edx ebx ebp esi edi
.B2.4: ; Preds .B2.3
00015 40 inc eax
00016 3d 80 00 00 00 cmp eax, 128
0001b 7c eb jl .B2.2 ; Prob 99%
; LOE eax edx ebx ebp esi edi
.B2.5: ; Preds .B2.4

;;; return 1;
;;; return 0;

0001d 33 c0 xor eax, eax
0001f c3 ret
; LOE
.B2.7: ; Preds .B2.2 .B2.3 ; Infreq
00020 b8 01 00 00 00 mov eax, 1
00025 c3 ret
00026 8d 76 00 8d bc
27 00 00 00 00 ALIGN 16
; LOE
; mark_end;
_test_2 ENDP
;_test_2 ENDS


; -- Begin _test_1
; mark_begin;
ALIGN 16
PUBLIC _test_1
_test_1 PROC NEAR
; parameter 1: 4 + esp
.B3.1: ; Preds .B3.0

;;; {
;;; int i;
;;; ts *t = &array[index][0];
;;; for(i = 0; i < 128; i++)

00000 33 c0 xor eax, eax
; LOE eax ebx ebp esi edi
.B3.2: ; Preds .B3.4 .B3.1

;;; if(t.x & 2 || t.y & 4)

00002 f6 04 45 00 0a
00 00 02 test BYTE PTR [_array+2560+eax*2], 2
0000a 75 15 jne .B3.7 ; Prob 20%
; LOE eax ebx ebp esi edi
.B3.3: ; Preds .B3.2
0000c f6 04 45 01 0a
00 00 04 test BYTE PTR [_array+2561+eax*2], 4
00014 75 0b jne .B3.7 ; Prob 20%
; LOE eax ebx ebp esi edi
.B3.4: ; Preds .B3.3
00016 40 inc eax
00017 3d 80 00 00 00 cmp eax, 128
0001c 7c e4 jl .B3.2 ; Prob 99%
; LOE eax ebx ebp esi edi
.B3.5: ; Preds .B3.4

;;; return 1;
;;; return 0;

0001e 33 c0 xor eax, eax
00020 c3 ret
; LOE
.B3.7: ; Preds .B3.2 .B3.3 ; Infreq
00021 b8 01 00 00 00 mov eax, 1
00026 c3 ret
00027 8b f6 8d bc 27
00 00 00 00 ALIGN 16
; LOE
; mark_end;
_test_1 ENDP
;_test_1 ENDS[/bash]

p.s. It seems that HTML code editor replaces the i < 128, j < 128 with "i<128" and "j <128", I apologize for that, just don't know how to post it properly I guess.
0 Kudos
jimdempseyatthecove
Honored Contributor III
347 Views
Run a benchmark using each method. On iterations 2:128 the loop will be in the L1 instruction cache. When the array data is in memory (or L3 cach, or maybe L2 cache) the extra few bytes in the instruction might have no effect on performance.

The performance difference will vary depending on processor archetecture.

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
347 Views
I might add, do the benchmark in the manner in which you useexpect to reference the data.
If your application references separate arrays of data, or uses the same array but with different values each time, then broaden your test application to include this behavior.

Jim
0 Kudos
xcomponent
Beginner
347 Views
Ok, to summarize things. Running loop benchmarks for the posted simple code and the real world application shows that the usage of auxiliary array of pointers is clearly faster. Assume:EDX with the address from the value of the local ptr variable stays though. Unfortunately, Loop benchmarks in my case are useless, whether the tested code is the real and complex one, or the simpler that I've posted. So I've benchmarked the real world application with and without the ptr array, and the results show that without using the auxiliary array, the application is 1.5-2% faster, mainly due to the intesive usage of that part of the code. I've tested it on 2 different CPU's with the same result. I would like to underline, that the code difference exists only with intel x86. The main point is that the two methods - using local var, which contains an address from auxiliary array of pointers to an array's rows and using local var with the address of an array's row are supossed to be equivalent, but this is not the case here and depending on the code and/or hardware, one of those two will be faster.
Thank you!
0 Kudos
Reply