Two optimization suggestions

Andrei_P_1 · ‎10-20-2016

Hello. I am want to propose two features from MSVC compiler that not available in ICC. All tests are made on windows x86.

1) Combining several small mov's to the one. Example:

struct struct_t
{
	char a, b, c, d;
};

void __declspec(noinline) test(struct_t& s)
{
	s.a = 'a';
	s.b = 'b';
	s.c = 'c';
	s.d = 'd';
}

Code by the current ICC with -Ox:

        mov       BYTE PTR [eax], 97
        mov       BYTE PTR [1+eax], 98
        mov       BYTE PTR [2+eax], 99
        mov       BYTE PTR [3+eax], 100
        ret

This four byte mov's can be combined to the single dword mov like it does MSVC:

        mov       DWORD PTR [ecx], 1684234849		; 64636261H
        ret

2) Eliminate useless copying from volatile memory to registers. I think it's correct, and MSVC does this optimization. Example:

#include <stdio.h>

bool isInt(int)
{
	return true;
}

bool isInt(short)
{
	return false;
}

void __declspec(noinline) test()
{
	volatile int a = 5;
	volatile short b = 2;
	printf("int = %i, short = %i\n", isInt(a), isInt(b));
}

Result:

        sub       esp, 8
        mov       eax, 2
        mov       DWORD PTR [esp], 5
        mov       WORD PTR [4+esp], ax
        mov       edx, DWORD PTR [esp] ; <- unnecessary copying
        movzx     ecx, WORD PTR [4+esp] ; <- unnecessary copying
        ; here was a copying from edx and ecx to non-volatile memory, but it was eliminated as a deadcode
        push      0
        push      1
        push      OFFSET "int = %i, float = %i\n"
        call      DWORD PTR [__imp__printf]
        add       esp, 12
        add       esp, 8
        ret

And here we can also see a two uncombined add's before return.

jimdempseyatthecove · ‎10-24-2016

volatile requires memory to be reference. Imagine if the address were an I/O port, e.g. mouse register. You wouldn't see the mouse move (if you kept reading the registered copy).

Jim Dempsey

Judith_W_Intel · ‎10-24-2016

We've entered suggestion (1) in our internal bugs tracking database as DPD200415370. There are potential alignment & store forwarding issues to deal with, but currently, I believe we lack the fundamental capability to do this, even when we know the resulting 32-bit store will be aligned and when targeting architectures where subsequent small loads will all forward.

As far as (2) we asked our language expert and he said

(!) Is the compiler required to load volatiles a and b in the program below due to the calls to isInt(a) & isInt(b)

Yep. They don't strictly have to be moved to any particular registers, but they have to be loaded

(2) Would the answer be different if the arguments to isInt were named?

Nope. The references to a and b aren't inside either of the functions named isInt.

Perhaps we're not using the same switches or compiler version but we see that Microsoft loads them also when we look at the assembly.

thanks for the suggestions.

Judy

McCalpinJohn · ‎10-25-2016

One of my favorite low-level optimizations comes from gcc. When I do low-level timer or performance counter work, my code is full of performance counter reads that the Intel compiler turns into code like:

RDPMC                                # read performance counters, getting low 32 bits in %eax and high 16 bits in %edx

MOVL %edx,%edx               # I guess this is to ensure that the high-order bits are clear? may be "executed" in the renamer?

MOVL %eax, %eax              # I guess this is to ensure that the high-order bits are clear? may be "executed" in the renamer?

SHLQ $32, %rdx                   # shift the high order bits

ORQ %rdx,%rax                   # combine the upper and lower bits into a single 64-bit register

MOVQ %rax,(%rsp)              # store the 64-bit value

I had to laugh out loud when I saw what gcc did with the code. Basically it replaced the shift and OR with a simple 32-bit store. Something like:

RDPMC

MOV %eax,(%rsp)

MOV %edx,4(%rsp)

This may not be any faster than the Intel-generated code -- but it was funny to have a compiler call me an idiot.