Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Two flaws in code generation

Iliyan_Georgiev
Beginner
612 Views
I've been trying to find out why my very time critical application runs slower when compiled with ICC 10.1.020 than with MSVC (8 and 9). So far I have found two flaws in code generation:

1) This is a minor one but impacted a part of my code significantly. The SSE2 intrinsic _mm_set1_ps(float) generates 3 instructions - 1 movss and 2 unpcklps. In comparison, MSVC generates 1 movss and 1 shufps. And on PM and Core2 architectures shufps is exactly as costly as unpcklps. So I replaced the use of _mm_set1_ps() by 1 _mm_set_ss() and 1 _mm_shuffle_ps() and that part of the code (about 40 code lines, using float replication a few times) got 25% faster on my Pentium M. I was a bit relieved.

But the overall speed increased only a bit. SO I continued digging around..

2) With all optimizations turned on (including inlining of every critical function), I found in the generated assembly that there was still one function call, and it was to a compiler-generated function - the so-called "vector constructor iterator" of a custom class with a custom default constructor. I cannot give a concrete example, because I just found it and cannot isolate a small example out of it instantly, but the situation is the following:

struct MyStruct
{
int a;
MyStruct() : a(0) {}
}


// somewhere in my code
MyStruct s[3];

And at this point a call to the following function was made:

MyStruct::`vector constructor iterator':
00498630 mov edx,dword ptr [esp+4]
00498634 mov eax,ecx
00498636 test edx,edx
00498638 je MyStruct::`vector constructor iterator'+18h (498648h)
0049863A mov dword ptr [ecx],0
00498640 add ecx,4
00498643 add edx,0FFFFFFFFh
00498646 jne MyStruct::`vector constructor iterator'+0Ah (49863Ah)
00498648 ret 4
0049864B nop

It's a small piece of code, but it's in a time critical place in my code and it's located at some completely different place in the binary so I guess it messes up the instruction cache totally (I'm speculating here, I'm not an expert). The array is allocated onto the stack so I don't see a reason why this call should not be inlined.

Is this a known problem? Any suggestions (except for working around it by modifying much of my code)?
0 Kudos
12 Replies
JenniferJ
Moderator
612 Views

We need the source code and compile options so we can tell if there's a bug(s). Could you send a testcase or at least some code snippets? It would be better if you can file an issue report to PremierSupport -- see the release notes for how to submitting.

0 Kudos
JenniferJ
Moderator
612 Views

Be more clear.

I'm trying to create a testcase on the 2nd issue, but we need some source code for the 1st issue.

0 Kudos
Iliyan_Georgiev
Beginner
612 Views
OK, I will provide a simple test in a day or two. By the way, I noticed that using the "Favor fast code" option, at many points in my program when the data doesn't fit into the registers, the ICC breaks down and generates appr. 3 times more instructions than MSVC, many of which are again "unpcklps". If I remove a single line of code so that data fits again into the registers, ICC generates better code. Very strange behavior, which unfortunately slows down my application again by some 20%. I will test more and report.
0 Kudos
JenniferJ
Moderator
612 Views

I also try with your code snippets on #2 issue, couldn't duplicate the problem at /Ox or /O2 or /O1. Could you also send a testcase for this as well? It's probably a corner case.

Thanks!

0 Kudos
levicki
Valued Contributor I
612 Views
The SSE2 intrinsic _mm_set1_ps(float) generates 3 instructions - 1 movss and 2 unpcklps.

Jennifer, I can confirm this behavior because I have seen it in my code as well (that is exactly why I still write the most critical code in assembler) — it seems that the compiler avoids SHUFPS for some reason regardless of the CPU target.

0 Kudos
JenniferJ
Moderator
612 Views

Igor,

is it ok to send me some code? so we can find the root-cause and fix it.

Thanks in advance!

0 Kudos
levicki
Valued Contributor I
612 Views

Here, try this:


#include 
#include 
#include 

__m128 set1(float a)
{
	return _mm_set1_ps(a);
}

__m128 load1(float *a)
{
	return _mm_load_ps1(a);
}

int main(int argc, char *argv[])
{
	FILE	*fp;
	float	a;
	__m128	b;

	a = atof(argv[1]);

	b = set1(a);
//	b = load1(&a);

	fopen_s(&fp, "test.bin", "wb");
	fwrite(&b, 1, sizeof(__m128), fp);
	fclose(fp);

	return 0;
}
0 Kudos
JenniferJ
Moderator
612 Views

Thanks Igor.

I saw the problem with "icl -QxW -FA -O2 -c t.cpp" and "cl - arch:SSE2 -O2 -c -FA -t.cpp". Is "QxW" the option you're using?

0 Kudos
levicki
Valued Contributor I
612 Views

I have used -QxS -O3.

0 Kudos
levicki
Valued Contributor I
612 Views

Jennifer, have you managed to reproduce the problem with /QxS switch? If not, try this code:


// compile with: icl /c /FA /QxS jen.c

#include 

#define ALIGN	__declspec(align(128))
#define FORCE	volatile
#define CONST	const

ALIGN	float	scale[5] = { 0.0f, 1.0f, 2.0f, 3.0f, 4.0f };

void somefunc(void)
{
FORCE	__m128	v0, v1;
CONST	float	f0 = 1.0f, f1 = 2.0f;

	v0 = _mm_add_ps(_mm_set1_ps(f0), _mm_mul_ps(_mm_set1_ps(f1), _mm_load_ps(scale)));
	v1 = _mm_mul_ps(_mm_set1_ps(f1), _mm_set1_ps(scale[4]));
}

This is taken out from some real code, it is not just a contrived example.

0 Kudos
JenniferJ
Moderator
612 Views

yes, I did reproduce the problem and send to engineer to fix. Thanks for checking again.

0 Kudos
Iliyan_Georgiev
Beginner
612 Views
Thank you, Igor, for supporting me in this topic. I will try to make as small test as possible for the vector constructor iterator problem when I have time. In the mean time, I've found two even bigger optimization problems. See the latest topic by me.
0 Kudos
Reply