Two flaws in code generation

Iliyan_Georgiev · ‎03-06-2008

I've been trying to find out why my very time critical application runs slower when compiled with ICC 10.1.020 than with MSVC (8 and 9). So far I have found two flaws in code generation:

1) This is a minor one but impacted a part of my code significantly. The SSE2 intrinsic _mm_set1_ps(float) generates 3 instructions - 1 movss and 2 unpcklps. In comparison, MSVC generates 1 movss and 1 shufps. And on PM and Core2 architectures shufps is exactly as costly as unpcklps. So I replaced the use of _mm_set1_ps() by 1 _mm_set_ss() and 1 _mm_shuffle_ps() and that part of the code (about 40 code lines, using float replication a few times) got 25% faster on my Pentium M. I was a bit relieved.

But the overall speed increased only a bit. SO I continued digging around..

2) With all optimizations turned on (including inlining of every critical function), I found in the generated assembly that there was still one function call, and it was to a compiler-generated function - the so-called "vector constructor iterator" of a custom class with a custom default constructor. I cannot give a concrete example, because I just found it and cannot isolate a small example out of it instantly, but the situation is the following:

struct MyStruct
{
int a;
MyStruct() : a(0) {}
}

// somewhere in my code
MyStruct s[3];

And at this point a call to the following function was made:

MyStruct::`vector constructor iterator':
00498630 mov edx,dword ptr [esp+4]
00498634 mov eax,ecx
00498636 test edx,edx
00498638 je MyStruct::`vector constructor iterator'+18h (498648h)
0049863A mov dword ptr [ecx],0
00498640 add ecx,4
00498643 add edx,0FFFFFFFFh
00498646 jne MyStruct::`vector constructor iterator'+0Ah (49863Ah)
00498648 ret 4
0049864B nop

It's a small piece of code, but it's in a time critical place in my code and it's located at some completely different place in the binary so I guess it messes up the instruction cache totally (I'm speculating here, I'm not an expert). The array is allocated onto the stack so I don't see a reason why this call should not be inlined.

Is this a known problem? Any suggestions (except for working around it by modifying much of my code)?

JenniferJ · ‎03-10-2008

We need the source code and compile options so we can tell if there's a bug(s). Could you send a testcase or at least some code snippets? It would be better if you can file an issue report to PremierSupport -- see the release notes for how to submitting.

JenniferJ · ‎03-10-2008

Be more clear.

I'm trying to create a testcase on the 2nd issue, but we need some source code for the 1st issue.

Iliyan_Georgiev · ‎03-10-2008

OK, I will provide a simple test in a day or two. By the way, I noticed that using the "Favor fast code" option, at many points in my program when the data doesn't fit into the registers, the ICC breaks down and generates appr. 3 times more instructions than MSVC, many of which are again "unpcklps". If I remove a single line of code so that data fits again into the registers, ICC generates better code. Very strange behavior, which unfortunately slows down my application again by some 20%. I will test more and report.

JenniferJ · ‎03-11-2008

I also try with your code snippets on #2 issue, couldn't duplicate the problem at /Ox or /O2 or /O1. Could you also send a testcase for this as well? It's probably a corner case.

Thanks!

levicki · ‎03-11-2008

The SSE2 intrinsic _mm_set1_ps(float) generates 3 instructions - 1 movss and 2 unpcklps.

Jennifer, I can confirm this behavior because I have seen it in my code as well (that is exactly why I still write the most critical code in assembler) — it seems that the compiler avoids SHUFPS for some reason regardless of the CPU target.

JenniferJ · ‎03-13-2008

Igor,

is it ok to send me some code? so we can find the root-cause and fix it.

Thanks in advance!

levicki · ‎03-13-2008

Here, try this:


#include 
#include 
#include 

__m128 set1(float a)
{
	return _mm_set1_ps(a);
}

__m128 load1(float *a)
{
	return _mm_load_ps1(a);
}

int main(int argc, char *argv[])
{
	FILE	*fp;
	float	a;
	__m128	b;

	a = atof(argv[1]);

	b = set1(a);
//	b = load1(&a);

	fopen_s(&fp, "test.bin", "wb");
	fwrite(&b, 1, sizeof(__m128), fp);
	fclose(fp);

	return 0;
}

JenniferJ · ‎03-14-2008

Thanks Igor.

I saw the problem with "icl -QxW -FA -O2 -c t.cpp" and "cl - arch:SSE2 -O2 -c -FA -t.cpp". Is "QxW" the option you're using?

levicki · ‎03-14-2008

I have used -QxS -O3.

levicki · ‎03-20-2008

Jennifer, have you managed to reproduce the problem with /QxS switch? If not, try this code:


// compile with: icl /c /FA /QxS jen.c

#include 

#define ALIGN	__declspec(align(128))
#define FORCE	volatile
#define CONST	const

ALIGN	float	scale[5] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f };

void somefunc(void)
{
FORCE	__m128	v0, v1;
CONST	float	f0 = 1.0f, f1 = 2.0f;

	v0 = _mm_add_ps(_mm_set1_ps(f0), _mm_mul_ps(_mm_set1_ps(f1), _mm_load_ps(scale)));
	v1 = _mm_mul_ps(_mm_set1_ps(f1), _mm_set1_ps(scale[4]));
}

This is taken out from some real code, it is not just a contrived example.

JenniferJ · ‎03-21-2008

yes, I did reproduce the problem and send to engineer to fix. Thanks for checking again.

Iliyan_Georgiev · ‎03-26-2008

Thank you, Igor, for supporting me in this topic. I will try to make as small test as possible for the vector constructor iterator problem when I have time. In the mean time, I've found two even bigger optimization problems. See the latest topic by me.